Putting Together Alignments & Comparing Assemblies Michael Brudno Department of Computer Science University of Toronto 6.095/6.895 - Computational Biology:

Slides:



Advertisements
Similar presentations
Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004.
Advertisements

Bioinformatics Chromosome rearrangements Chromosome and genome comparison versus gene comparison Permutations and breakpoint graphs Transforming Men into.
DNA Sequencing.
Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm.
Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication.
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
Some new sequencing technologies. Molecular Inversion Probes.
1 Protein Multiple Alignment by Konstantin Davydov.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
DNA Sequencing Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host)
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Assembly.
DNA Sequencing and Assembly
DNA Sequencing.
Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia.
CS273a Lecture 4, Autumn 08, Batzoglou Fragment Assembly (in whole-genome shotgun sequencing) CS273a Lecture 5.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
CS273a Lecture 9/10, Aut 10, Batzoglou Multiple Sequence Alignment.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
DNA Sequencing. CS273a Lecture 3, Spring 07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
DNA Sequencing. CS262 Lecture 9, Win06, Batzoglou DNA Sequencing – gel electrophoresis 1.Start at primer(restriction site) 2.Grow DNA chain 3.Include.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
CS262 Lecture 9, Win07, Batzoglou Conditional Random Fields A brief description.
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
Genome sequencing and assembling
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Sequencing a genome and Basic Sequence Alignment
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
CS 394C March 19, 2012 Tandy Warnow.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.
CS273a Lecture 4, Autumn 08, Batzoglou CS273a 2011 DNA Sequencing.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Genome sequencing Haixu Tang School of Informatics.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
Sequencing a genome and Basic Sequence Alignment
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
Human Genome.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
DNA, RNA and protein are an alien language
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Genome Research 12:1 (2002), Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments.
DNA Sequencing Project
Fragment Assembly (in whole-genome shotgun sequencing)
Genome sequence assembly
The ideal approach is simultaneous alignment and tree estimation.
Genome Projects Maps Human Genome Mapping Human Genome Sequencing
CSCI 1810 Computational Molecular Biology 2018
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Presentation transcript:

Putting Together Alignments & Comparing Assemblies Michael Brudno Department of Computer Science University of Toronto 6.095/ Computational Biology: Genomes, Networks, Evolution Lecture 23 – Guest LectureDec 1, 2005

Overview Intro to Assembly –Overlap-Layout-Consensus –String graph method for assembly Intro to Alignments –Global Alignment (LAGAN) –Glocal alignment (Rearrangements) Putting it Together

The Human Genome ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCT CCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTGG CCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCA GGAATAAGGAAAAGCAGCTCCTGACTTTCCTCGCTTGGTGGT TTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGA GAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCAC CCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAG GAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTC ACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAACTC CTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCC AGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGG CCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCG CCGGGACAGAATGCCTGCAGGAACTTCTTCTGGAAGACCTTC TCCTCCTGCAAATAAAACCTCACCCATGGGAATGCTCACGCA TTTAATTACAGACCTGAAAGGAGAGGAAGCTCGGGAGGTGG

Whole Genome Shotgun Sequencing cut many times at random genome forward-reverse paired reads plasmids (2 – 10 Kbp) cosmids (40 Kbp) known dist ~500 bp

Overview Intro to Assembly –Overlap-Layout-Consensus –String graph method for assembly Intro to Alignments –Global Alignment (LAGAN) –Glocal alignment (Rearrangements) Putting it Together

Fragment Assembly Section “borrowed” from Serafim Batzoglou

Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT.. 2. Merge some “good” pairs of reads into longer contigs 3. Link contigs to form supercontigs Some Terminology read a long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads with no gaps supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the sequene multiple alignment of reads in a contig

1. Find Overlapping Reads Sort all k-mers in reads (k = 24) TAGATTACACAGATTAC ||||||||||||||||| Find pairs of reads sharing a k-mer Extend to full alignment – throw away if not >97% similar T GA TAGA | || TACA TAGT ||

2. Merge Reads into Contigs Merge reads up to potential repeat boundaries repeat region Unique Contig Overcollapsed Contig

2. Merge Reads into Contigs Overlap graph: –Nodes: reads r 1 …..r n –Edges: overlaps (r i, r j, shift, orientation, score) Remove transitively inferrable overlaps

Overlap graph after forming contigs

Repeats, errors, and contig lengths Repeats shorter than read length are OK Repeats with more base pair diffs than sequencing error rate are OK To make the genome appear less repetitive, try to: –Increase read length –Decrease sequencing error rate Role of error correction: Discards ~90% of single-letter sequencing errors decreases error rate  decreases effective repeat content  increases contig length

4. Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting (Alternative: take maximum-quality letter)

Some Assemblers PHRAP Early assembler, widely used, good model of read errors Overlap O(n 2 ) -- layout (no mate pairs) -- consensus Celera First assembler to handle large genomes (fly, human, mouse) Overlap – layout -- consensus Arachne Public assembler (mouse, several fungi) Overlap – layout -- consensus Euler Indexing -- deBruijn graph -- picking paths -- consensus

Overview Intro to Assembly –Overlap-Layout-Consensus –String graph method for assembly Intro to Alignments –Global Alignment (LAGAN) –Glocal alignment (Rearrangements) Putting it Together

String Graph Concept Given a shotgun dataset of reads we should be able to build a graph that looks like this: x 1 There are two possible tours: Myers 2005

How To Build A String Graph A B  Remove Transitive Overlaps O(E) expected-time alg. AB B-A Junction  Collapse Chains CompressedEdge

Orientation: Bi-directed Graphs  DNA can be read in 2 directions  Reads can be used in either direction  Junction points are directed  An edge can be used in both directions

Edge Labels Estimate the arrival rate of fragments & size of genome (look at all edges over 10Kbp long (almost all are unique)) Classify edges as follows: –=1: Probability edge is not unique < e -18 Celera A-statistic f interior pts. –  1: Has an interior vertex –  0: Otherwise.

Reasoning About “Flows” ac b x y z =1=1 ≥ 0 ≥ 1 =1 ≥ 0 Want a+b+c = x+y+z = 0 = 1 ≥ 2 Brudno, Davidson, Myers 200?

Real Data Has Errors  Reads from multiple places in the genome (chimers)  Some overlaps are missed due to errors and polymorphisms

Error Correction Algorithm  Build local alignments between all read pairs We use a very fast O(N+d 2 ) algorithm  Fix parts of reads (indels, mutations) that are not supported by any read and are contradicted by at least 2  Some errors are impossible to fix

Achieve a Feasible Flow  Remove fewest number of reads: add back-edges Penalty for back-edge equal to number of reads  Edge + back edge form a cycle: edge eliminated

C. jejuni Genome  1.7 Mb; 24,000 reads  Initial graph: 129 nodes, 174 edges  After Flow solving (< 3 minutes total run time): 22 nodes 35 edges 4 edges (5 reads) rejected

Iterating Flow Solving  On larger genomes there may not be a unique min cost flow  We can iterate flow solving:  Add  penalty to all edges in solution  Solve flow again – if there is an alternate min cost flow it will now be smaller  Repeat until no new edges  Edges are labeled - Required In all solutions - UnreliableIn some solutions - Unneeded In no solutions

S. bayanus genome  11.5 Mb genome; 6.4X coverage  Initial graph: 3367 edges 804 =1; 1589  1; 1698  0  After Flow solving (9 iterations): Of the 1698 edges: Of the 1698 edges: 1047 eliminated; 204 required; 447 unreliable 17 edges rejected: 8 Bubbles9 Splinters  Total running time for S. bayanus < 10 minutes

Future Work Use the mate pairs to build path  Separate repeats  Build multi-alignments for edges

Overview Intro to Assembly –Overlap-Layout-Consensus –String graph method for assembly Intro to Alignments –Global Alignment (LAGAN) –Glocal alignment (Rearrangements) Putting it Together

The Human Genome ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCT CCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTGG CCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCA GGAATAAGGAAAAGCAGCTCCTGACTTTCCTCGCTTGGTGGT TTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGA GAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCAC CCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAG GAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTC ACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAACTC CTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCC AGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGG CCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCG CCGGGACAGAATGCCTGCAGGAACTTCTTCTGGAAGACCTTC TCCTCCTGCAAATAAAACCTCACCCATGGGAATGCTCACGCA TTTAATTACAGACCTGAAAGGAGAGGAAGCTCGGGAGGTGG

Basic Biology DNA (4 residues, Double-stranded) RNA (4 residues, Single-stranded) Protein (20 amino acids) –A.a. code: triplet of RNA codes 1 amino acid UTR exon gene exon UTR exon UTR exon E P ATG

The Human Genome ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCT CCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTGG CCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCA GGAATAAGGAAAAGCAGCTCCTGACTTTCCTCGCTTGGTGGT TTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGA GAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCAC CCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAG GAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTC ACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAACTC CTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCC AGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGG CCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCG CCGGGACAGAATGCCTGCAGGAACTTCTTCTGGAAGACCTTC TCCTCCTGCAAATAAAACCTCACCCATGGGAATGCTCACGCA TTTAATTACAGACCTGAAAGGAGAGGAAGCTCGGGAGGTGG

Complete DNA Sequences nearly 200 complete genomes have been sequenced

Complete DNA Sequences ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGC CACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGA CAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCTCCTGA CTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCC CCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGC ACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTC TTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCA CGCAAGTTTAATTACAGACCTGAACTCCTGACTTTCCTCGCTTGGTGGTTT GAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAAGGAAGCT CGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCG CGCCGGGACAGAATGCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCC TGCAAATAAAACCTCACCCATGGGAATGCTCACGCATTTAATTACAGACCT GAAAGGAGAGGAAGCTACAGTCATGTGCFCGGGAGGTGGGCATCTGACA ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCC ACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACA GCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCTCCTGACTT TCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTC ATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCC CCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGG AAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGT TTAATTACAGACCTGAACTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGAC CTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTG GCATGTGACCTCCGAGCAGTCACCADCCAGGCGGCAGGAAGGCGCACCC CCCCAGCAATCCGCGCGCCGGGACAGAATGCCTGCAGGAACTTCTTCTGG AAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGGGAATGCTCACGCA TTTAATTACAGACCTGAAAGGAGAGGAAGCTCGGGAGGTGGGCATCTGACA ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGC CACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGA CAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCTCCTGA CTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCC CCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGC ACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTC TTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCA CGCAAGTTTAATTACAGACCTGAACTCCTGACTTTCCTCGCTTGGTGGTTT GAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAAGGAAGCT CGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCG CGCCGGGACAGAATGCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCC TGCAAATAAAACCTCACCCATGGGAATGCTCACGCATTTAATTACAGACCT GAAAGGAGAGGAAGCTACAGTCATGTGCFCGGGAGGTGGGCATCTGACA ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCC ACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACA GCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCTCCTGACTT TCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTC ATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCC CCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGG AAGACCTCCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAG TTTAATTACAGACCTGAACTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGA CCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGT GGCATGTGACCTCCGAGCAGTCACCADCCAGGCGGCAGGAAGGCGCACC CCCCCAGCAATCCGCGCGCCGGGACAGAATGCCTGCAGGAACTTCTTCTG GAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGGGAATGCTCACGC ATTTAATTACAGACCTGAAAGGAGAGGAAGCTCGGGAGGTGGGCATCTGAC ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGC CACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGA CAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCTCCTGA CTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCC CCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGC ACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTC TTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCA CGCAAGTTTAATTACAGACCTGAACTCCTGACTTTCCTCGCTTGGTGGTTT GAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAAGGAAGCT CGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCG CGCCGGGACAGAATGCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCC TGCAAATAAAACCTCACCCATGGGAATGCTCACGCATTTAATTACAGACCT GAAAGGAGAGGAAGCTACAGTCATGTGCFCGGGAGGTGGGCATCTGACA

Evolution

Conservation Implies Function Exon Gene CNS: Other Conserved Dubchak, Brudno et al 2000

Overview Intro to Assembly –Overlap-Layout-Consensus –String graph method for assembly Intro to Alignments –Global Alignment (LAGAN) –Glocal alignment (Rearrangements) Putting it Together

Edit Distance Model (1) Weighted sum of insertions, deletions & mutations to transform one string into another AGGCACA--CA AGGCACACA | |||| || or | || || A--CACATTCA ACACATTCA Levenshtein 1966

Edit Distance Model (2) Given:x, y Define:F(i,j) = Score of best alignment of x 1 …x i to y 1 …y j Recurrence:F(i,j) = max (F(i-1,j) – GAPPENALTY, F(i,j-1) – GAPPENALTY, F(i-1,j-1) + SCORE(x i, y j )) F(i,j) F(i,j-1) 7 F(i-1,j) 6 F(i-1,j-1) 5 5 A T Gappenalty = 2 Score(A,T) = -1

Edit Distance Model (3) F(i,j) = Score of best alignment ending at i,j Time O( n 2 ) for two seqs,  ( n k ) for k seqs F(i,j-1) F(i,j) F(i-1,j) F(i,j-1) AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Needleman & Wunsch 1970

Global Alignment x y z AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

0% 50%100% The Theory

LAGAN: 1. FIND Local Alignments 1.Find Local Alignments 2.Chain Local Alignments 3.Restricted DP Brudno, Do 2003

LAGAN: 2. CHAIN Local Alignments 1.Find Local Alignments 2.Chain Local Alignments 3.Restricted DP

LAGAN: 3. Restricted DP 1.Find Local Alignments 2.Chain Local Alignments 3.Restricted DP

MLAGAN: 1. Progressive Alignment Given N sequences, phylogenetic tree Align pairwise, in order of the tree (LAGAN) Human Baboon Mouse Rat

MLAGAN: 2. Multi-anchoring X Z Y Z X/Y Z To anchor the (X/Y), and (Z) alignments:

Cystic Fibrosis (CFTR), 12 species Human sequence length: 1.8 Mb Total genomic sequence: 13 Mb Human Baboon Cat Dog Cow Pig Mouse Rat Chimp Chicken Fugufish Zebrafish

CFTR (cont’d ) % Mammals AVID % Chicken & Fishes % Mammals LAGAN % Chicken & Fishes Mammals Chicken & Fishes Mammals % % MLAGAN 98% % BLASTZ MAX MEMORY (Mb) TIME (sec) % Exons Aligned

Overview Intro to Assembly –Overlap-Layout-Consensus –String graph method for assembly Intro to Alignments –Global Alignment (LAGAN) –Glocal alignment (Rearrangements) Putting it Together

Local & Global Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA Local Global

Glocal Alignment Problem Find least cost transformation of one sequence into another using new operations Sequence edits Inversions Translocations Duplications Combinations of above AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

S-LAGAN: Find Local Alignments 1.Find Local Alignments 2.Build Rough Homology Map 3.Globally Align Consistent Parts

S-LAGAN: Build Homology Map 1.Find Local Alignments 2.Build Rough Homology Map 3.Globally Align Consistent Parts

Building the Homology Map Chain (using Eppstein Galil); each alignment gets a score which is MAX over 4 possible chains. Penalties are affine (event and distance components) Penalties: a)regular b)translocation c)inversion d)inverted translocation a b c d

S-LAGAN: Build Homology Map 1.Find Local Alignments 2.Build Rough Homology Map 3.Globally Align Consistent Parts

S-LAGAN: Global Alignment 1.Find Local Alignments 2.Build Rough Homology Map 3.Globally Align Consistent Parts

S-LAGAN Results (CFTR) LocalLocal GlocalGlocal

Hum/MusHum/Mus H u m / R at

S-LAGAN results (HOX) 12 paralogous genes Conserved order in mammals

S-LAGAN results (HOX) 12 paralogous genes Conserved order in mammals

S-LAGAN results (IGF cluster)

Handling Chromosomes & Symmetry Problems: –S-LAGAN is meant to run on two sequences –S-LAGAN is not symmetric (it has a base genome) Solutions: –Switch penalty –Super-monotonic maps Sundararajan, Brudno 2004

Handling Chromosomes: Switch Penalty Switch Penalty Chr 3Chr 2Chr 1Chr 4 Base chromosome

Problems with Non-symmetry Duplications are only caught in the base sequence

Problems with Non-symmetry Translocations lead to different alignments, and include non-hologous sequences Brudno, Kislyuk 200?

Supermap Algorithm Build 1-monotonic maps with both base genomes (cyan & pink) Duplication Inversion Translocation

Supermap Algorithm Build 1-monotonic maps with both base genomes (cyan & pink) Duplication Inversion Translocation

Supermap Algorithm Build 1-monotonic maps with both base genomes (cyan & pink) Whenever the maps agree, join them (blue) Duplication Inversion Translocation

Supermap Algorithm Build 1-monotonic maps with both base genomes (cyan & pink) Whenever the maps agree, join them (blue) Syntenic areas start wherever paths split Duplication Inversion Translocation

Human & Mouse Rearrangement Map

Human Genome Alignment Results Compared with the previous tandem local/global approach: 2-fold speedup Sensitivity of exon alignment unchanged in human/mouse, improved in human/chicken 9-fold reduction in the number of mapped syntenic segments in human/mouse. Coverage in 2 nd species slightly higher

Overview Intro to Assembly –Overlap-Layout-Consensus –String graph method for assembly Intro to Alignments –Global Alignment (LAGAN) –Glocal alignment (Rearrangements) Putting it Together

Acknowledgments Stanford: Serafim Batzoglou Arend Sidow Kerrin Small Chuong (Tom) Do Mukund Sundararajan Lawrence Berkeley Lab: Inna Dubchak Alexander Poliakov Andrey Kislyuk HHMI- Janelia: Gene Myers Stuart Davidson Thank You!