CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm.

CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm

CS262 Lecture 9, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT.. 2. Merge some “good” pairs of reads into longer contigs 3. Link contigs to form supercontigs Some Terminology read a 500-900 long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads with no gaps supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the sequene multiple alignment of reads in a contig

CS262 Lecture 9, Win07, Batzoglou 1. Find Overlapping Reads Find pairs of reads sharing a k-mer, k ~ 24 Extend to full alignment – throw away if not >98% similar TAGATTACACAGATTAC ||||||||||||||||| T GA TAGA | || TACA TAGT || Caveat: repeats  A k-mer that occurs N times, causes O(N 2 ) read/read comparisons  ALU k-mers could cause up to 1,000,000 2 comparisons Solution:  Discard all k-mers that occur “ too often ” Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available

CS262 Lecture 9, Win07, Batzoglou 1. Find Overlapping Reads Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA insert A replace T with C correlated errors— probably caused by repeats  disentangle overlaps TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA In practice, error correction removes up to 98% of the errors

CS262 Lecture 9, Win07, Batzoglou 2. Merge Reads into Contigs Overlap graph:  Nodes: reads r 1 …..r n  Edges: overlaps (r i, r j, shift, orientation, score) Note: of course, we don’t know the “color” of these nodes Reads that come from two regions of the genome (blue and red) that contain the same repeat

CS262 Lecture 9, Win07, Batzoglou 2. Merge Reads into Contigs

CS262 Lecture 9, Win07, Batzoglou Overlap graph after forming contigs Unitigs: Gene Myers, 95

CS262 Lecture 9, Win07, Batzoglou Repeats, errors, and contig lengths Repeats shorter than read length are easily resolved  Read that spans across a repeat disambiguates order of flanking regions Repeats with more base pair diffs than sequencing error rate are OK  We throw overlaps between two reads in different copies of the repeat To make the genome appear less repetitive, try to:  Increase read length  Decrease sequencing error rate Role of error correction: Discards up to 98% of single-letter sequencing errors decreases error rate  decreases effective repeat content  increases contig length

CS262 Lecture 9, Win07, Batzoglou 3. Link Contigs into Supercontigs Too dense  Overcollapsed Inconsistent links  Overcollapsed? Normal density

CS262 Lecture 9, Win07, Batzoglou Find all links between unique contigs 3. Link Contigs into Supercontigs Connect contigs incrementally, if  2 links supercontig (aka scaffold)

CS262 Lecture 9, Win07, Batzoglou Fill gaps in supercontigs with paths of repeat contigs 3. Link Contigs into Supercontigs

CS262 Lecture 9, Win07, Batzoglou 4. Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting (Alternative: take maximum-quality letter)

CS262 Lecture 9, Win07, Batzoglou Some Assemblers PHRAP Early assembler, widely used, good model of read errors Overlap O(n 2 )  layout (no mate pairs)  consensus Celera First assembler to handle large genomes (fly, human, mouse) Overlap  layout  consensus Arachne Public assembler (mouse, several fungi) Overlap  layout  consensus Phusion Overlap  clustering  PHRAP  assemblage  consensus Euler Indexing  Euler graph  layout by picking paths  consensus

CS262 Lecture 9, Win07, Batzoglou Quality of assemblies Celera’s assemblies of human and mouse

CS262 Lecture 9, Win07, Batzoglou Quality of assemblies—mouse

CS262 Lecture 9, Win07, Batzoglou Quality of assemblies—mouse N50 contig length Terminology: N50 contig length If we sort contigs from largest to smallest, and start Covering the genome in that order, N50 is the length Of the contig that just covers the 50 th percentile.

CS262 Lecture 9, Win07, Batzoglou Quality of assemblies—rat

CS262 Lecture 9, Win07, Batzoglou History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present  human (3Gbp), mouse (2.5Gbp), rat *, chicken, dog, chimpanzee, several fungal genomes Gene Myers Let’s sequence the human genome with the shotgun strategy That is impossible, and a bad idea anyway Phil Green 1997

CS262 Lecture 9, Win07, Batzoglou Genomes Sequenced http://www.genome.gov/10002154

CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 9, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication

CS262 Lecture 9, Win07, Batzoglou Protein Phylogenies Proteins evolve by both duplication and species divergence

CS262 Lecture 9, Win07, Batzoglou Orthology and Paralogy HB Human WB Worm HA1 Human HA2 Human Yeast WA Worm Orthologs: Derived by speciation Paralogs: Everything else Orthologs: Derived by speciation Paralogs: Everything else

CS262 Lecture 9, Win07, Batzoglou Orthology, Paralogy, Inparalogs, Outparalogs

CS262 Lecture 9, Win07, Batzoglou

Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global map is maximum A faint similarity between two sequences becomes significant if present in many Multiple alignments reveal elements that are conserved among a class of organisms and therefore important in their common biology The patterns of conservation can help us tell function of the element

CS262 Lecture 9, Win07, Batzoglou Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

CS262 Lecture 9, Win07, Batzoglou Sum Of Pairs (cont’d) Heuristic way to incorporate evolution tree: Human Mouse Chicken Weighted SOP: S(m) =  k<l w kl s(m k, m l ) Duck

CS262 Lecture 9, Win07, Batzoglou A Profile Representation Given a multiple alignment M = m 1 …m n  Replace each column m i with profile entry p i Frequency of each letter in  # gaps Optional: # gap openings, extensions, closings  Can think of this as a “likelihood” of each letter in each position - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1.8 C.6 1.4 1.6.2 G 1.2.2.4 1 T.2 1.6.2 -.2.8.4.8.4

CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments Algorithms

CS262 Lecture 9, Win07, Batzoglou Multidimensional DP Generalization of Needleman-Wunsh: S(m) =  i S(m i ) (sum of column scores) F(i 1,i 2,…,i N ): Optimal alignment up to (i 1, …, i N ) F(i 1,i 2,…,i N )= max (all neighbors of cube) (F(nbr)+S(nbr))

CS262 Lecture 9, Win07, Batzoglou Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i – 1, j – 1, k – 1) + S(x i, x j, x k ), F(i – 1, j – 1, k ) + S(x i, x j, - ), F(i – 1, j, k – 1) + S(x i, -, x k ), F(i – 1, j, k ) + S(x i, -, - ), F(i, j – 1, k – 1) + S( -, x j, x k ), F(i, j – 1, k ) + S( -, x j, - ), F(i, j, k – 1) + S( -, -, x k ) } Multidimensional DP

CS262 Lecture 9, Win07, Batzoglou Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP

CS262 Lecture 9, Win07, Batzoglou Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP How do gap states generalize? VERY badly!  Require 2 N – 1 states, one per combination of gapped/ungapped sequences  Running time: O(2 N  2 N  L N ) = O(4 N L N ) XYXYZZ YYZ XXZ

CS262 Lecture 9, Win07, Batzoglou Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z p xy p zw p xyzw

CS262 Lecture 9, Win07, Batzoglou Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z Example Profile: (A, C, G, T, -) p x = (0.8, 0.2, 0, 0, 0) p y = (0.6, 0, 0, 0, 0.4) s(p x, p y ) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -) Result: p xy = (0.7, 0.1, 0, 0, 0.2) s(p x, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -) Result: p x- = (0.4, 0.1, 0, 0, 0.5)

CS262 Lecture 9, Win07, Batzoglou Progressive Alignment When evolutionary tree is unknown:  Perform all pairwise alignments  Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment  Construct a tree (UPGMA / Neighbor Joining / Other methods)  Align on the tree x w y z ?

CS262 Lecture 9, Win07, Batzoglou Heuristics to improve alignments Iterative refinement schemes A*-based search Consistency Simulated Annealing …

CS262 Lecture 9, Win07, Batzoglou Iterative Refinement One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear correct y = GA-CTT

CS262 Lecture 9, Win07, Batzoglou Iterative Refinement Algorithm (Barton-Stenberg): 1.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 2.Repeat 4 until convergence x y z x,z fixed projection allow y to vary

CS262 Lecture 9, Win07, Batzoglou Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA

CS262 Lecture 9, Win07, Batzoglou Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing

CS262 Lecture 9, Win07, Batzoglou Consistency z x y xixi yjyj y j’ zkzk

CS262 Lecture 9, Win07, Batzoglou Consistency Basic method for applying consistency Compute all pairs of alignments xy, xz, yz, … When aligning x, y during progressive alignment,  For each (x i, y j ), let s(x i, y j ) = function_of(x i, y j, a xz, a yz )  Align x and y with DP using the modified s(.,.) function z x y xixi yjyj y j’ zkzk

CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable speed

CS262 Lecture 9, Win07, Batzoglou MUSCLE at a glance 1.Fast measurement of all pairwise distances between sequences D DRAFT (x, y) defined in terms of # common k-mers (k~3) – O(N 2 L logL) time 2.Build tree T DRAFT based on those distances, with UPGMA 3.Progressive alignment over T DRAFT, resulting in multiple alignment M DRAFT Only perform alignment steps for the parts of the tree that have changed 4.Measure new Kimura-based distances D(x, y) based on M DRAFT 5.Build tree T based on D 6.Progressive alignment over T, to build M 7.Iterative refinement; for many rounds, do: Tree Partitioning: Split M on one branch and realign the two resulting profiles If new alignment M’ has better sum-of-pairs score than previous one, accept

CS262 Lecture 9, Win07, Batzoglou PROBCONS at a glance 1.Computation of all posterior matrices M xy : M xy (i, j) = Prob(x i ~ y j ), using a HMM 2.Re-estimation of posterior matrices M’ xy with probabilistic consistency M’ xy (i, j) = 1/N  sequence z  k M xz (i, k)  M yz (j, k);M’ xy = Avg z (M xz M zy ) 3.Compute for every pair x, y, the maximum expected accuracy alignment A xy : alignment that maximizes  aligned (i, j) in A M’ xy (i, j) Define E(x, y) =  aligned (i, j) in Axy M’ xy (i, j) 4.Build tree T with hierarchical clustering using similarity measure E(x, y) 5.Progressive alignment on T to maximize E(.,.) 6.Iterative refinement; for many rounds, do: Randomized Partitioning: Split sequences in M in two subsets by flipping a coin for each sequence and realign the two resulting profiles

CS262 Lecture 9, Win07, Batzoglou Some Resources Genome Resources Annotation and alignment genome browser at UCSC http://genome.ucsc.edu/cgi-bin/hgGateway Specialized VISTA alignment browser at LBNL http://pipeline.lbl.gov/cgi-bin/gateway2 ABC—Nice Stanford tool for browsing alignments http://encode.stanford.edu/~asimenos/ABC/ Protein Multiple Aligners http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable http://probcons.stanford.edu/ PROBCONS – most accurate

CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm.

Similar presentations

Presentation on theme: "CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm.

Similar presentations

Presentation on theme: "CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm."— Presentation transcript:

Similar presentations

About project

Feedback