CS262 Lecture 12, Win07, Batzoglou Some new sequencing technologies.

CS262 Lecture 12, Win07, Batzoglou Some new sequencing technologies

CS262 Lecture 12, Win07, Batzoglou Molecular Inversion Probes

CS262 Lecture 12, Win07, Batzoglou Single Molecule Array for Genotyping—Solexa

CS262 Lecture 12, Win07, Batzoglou Nanopore Sequencing http://www.mcb.harvard.edu/branton/index.htm

CS262 Lecture 12, Win07, Batzoglou Pyrosequencing on a chip Mostafa Ronaghi, Stanford Genome Technologies Center 454 Life Sciences

CS262 Lecture 12, Win07, Batzoglou Polony Sequencing

CS262 Lecture 12, Win07, Batzoglou Some future directions for sequencing 1.Personalized genome sequencing Find your ~3,000,000 single nucleotide polymorphisms (SNPs) Find your rearrangements Goals: Link genome with phenotype Provide personalized diet and medicine (???) designer babies, big-brother insurance companies Timeline: Inexpensive sequencing:2010-2015 Genotype–phenotype association:2010-??? Personalized drugs:2015-???

CS262 Lecture 12, Win07, Batzoglou Some future directions for sequencing 2.Environmental sequencing Find your flora: organisms living in your body External organs: skin, mucous membranes Gut, mouth, etc. Normal flora: >200 species, >trillions of individuals Flora–disease, flora–non-optimal health associations Timeline: Inexpensive research sequencing:today Research & associationswithin next 10 years Personalized sequencing2015+ Find diversity of organisms living in different environments Hard to isolate Assembly of all organisms at once

CS262 Lecture 12, Win07, Batzoglou Some future directions for sequencing 3.Organism sequencing Sequence a large fraction of all organisms Deduce ancestors Reconstruct ancestral genomes Synthesize ancestral genomes Clone—Jurassic park! Study evolution of function Find functional elements within a genome How those evolved in different organisms Find how modules/machines composed of many genes evolved

Phylogeny Tree Reconstruction 1 4 3 2 5 1 4 2 3 5

CS262 Lecture 12, Win07, Batzoglou Phylogenetic Trees Nodes: species Edges: time of independent evolution Edge length represents evolution time  AKA genetic distance  Not necessarily chronological time

CS262 Lecture 12, Win07, Batzoglou Inferring Phylogenetic Trees Trees can be inferred by several criteria:  Morphology of the organisms Can lead to mistakes!  Sequence comparison Example: Orc: ACAGTGACGCCCCAAACGT Elf: ACAGTGACGCTACAAACGT Dwarf: CCTGTGACGTAACAAACGA Hobbit: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

CS262 Lecture 12, Win07, Batzoglou Modeling Evolution During infinitesimal time  t, there is not enough time for two substitutions to happen on the same nucleotide So we can estimate P(x | y,  t), for x, y  {A, C, G, T} Then let P(A|A,  t) …… P(A|T,  t) S(  t) = ……… P(T|A,  t) ……P(T|T,  t) xx y tt

CS262 Lecture 12, Win07, Batzoglou Modeling Evolution Reasonable assumption: multiplicative (implying a stationary Markov process) S(t+t’) = S(t)S(t’) That is to say, P(x | y, t+t’) =  z P(x | z, t) P(z | y, t’) Jukes-Cantor: constant rate of evolution 1 - 3     For short time , S(  ) = I+R  =  1 - 3      1 - 3      1 - 3  AC GT

CS262 Lecture 12, Win07, Batzoglou Modeling Evolution Jukes-Cantor: For longer times, r(t)s(t) s(t) s(t) S(t) = s(t)r(t) s(t) s(t) s(t)s(t) r(t) s(t) s(t)s(t) s(t) r(t) Where we can derive: r(t) = ¼ (1 + 3 e -4  t ) s(t) = ¼ (1 – e -4  t ) S(t+  ) = S(t)S(  ) = S(t)(I + R  ) Therefore, (S(t+  ) – S(t))/  = S(t) R At the limit of   0, S’(t) = S(t) R Equivalently, r’ = -3  r + 3  s s’ = -  s +  r Those diff. equations lead to: r(t) = ¼ (1 + 3 e -4  t ) s(t) = ¼ (1 – e -4  t )

CS262 Lecture 12, Win07, Batzoglou Modeling Evolution Kimura: Transitions: A/G, C/T Transversions: A/T, A/C, G/T, C/G Transitions (rate  ) are much more likely than transversions (rate  ) r(t)s(t) u(t) s(t) S(t) = s(t)r(t) s(t) u(t) u(t)s(t) r(t) s(t) s(t)u(t) s(t) r(t) Wheres(t) = ¼ (1 – e -4  t ) u(t) = ¼ (1 + e -4  t – e -2(  +  )t ) r(t) = 1 – 2s(t) – u(t)

CS262 Lecture 12, Win07, Batzoglou Phylogeny and sequence comparison Basic principles: Degree of sequence difference is proportional to length of independent sequence evolution Only use positions where alignment is pretty certain – avoid areas with (too many) gaps

CS262 Lecture 12, Win07, Batzoglou Distance between two sequences Given sequences x i, x j, Define d ij = distance between the two sequences One possible definition: d ij = fraction f of sites u where x i [u]  x j [u] Better model (Jukes-Cantor): f = 3 s(t) = ¾ (1 – e -4  t )  ¾ e -4  t = ¾ – f  log (e -4  t ) = log (1 – 4/3 f)  -4  t = log(1 – 4/3 f) d ij = t = - ¼  -1 log(1 – 4/3 f)

CS262 Lecture 12, Win07, Batzoglou A simple clustering method for building tree UPGMA (unweighted pair group method using arithmetic averages) Or the Average Linkage Method Given two disjoint clusters C i, C j of sequences, 1 d ij = –––––––––  {p  Ci, q  Cj} d pq |C i |  |C j | Claim that if C k = C i  C j, then distance to another cluster C l is: d il |C i | + d jl |C j | d kl = –––––––––––––– |C i | + |C j | Proof  Ci,Cl d pq +  Cj,Cl d pq d kl = –––––––––––––––– (|C i | + |C j |) |C l | |C i |/(|C i ||C l |)  Ci,Cl d pq + |C j |/(|C j ||C l |)  Cj,Cl d pq = –––––––––––––––––––––––––––––––––––– (|C i | + |C j |) |C i | d il + |C j | d jl = ––––––––––––– (|C i | + |C j |)

CS262 Lecture 12, Win07, Batzoglou Algorithm: Average Linkage Initialization: Assign each x i into its own cluster C i Define one leaf per sequence, height 0 Iteration: Find two clusters C i, C j s.t. d ij is min Let C k = C i  C j Define node connecting C i, C j, & place it at height d ij /2 Delete C i, C j Termination: When two clusters i, j remain, place root at height d ij /2 1 4 3 2 5 1 4 2 3 5

CS262 Lecture 12, Win07, Batzoglou Example vwxyz v 06888 w 0888 x 044 y 02 z 0 yzxwv 1 2 3 4 vwxyz v 0688 w 088 x 04 0 vwxyz v 068 w 08 0 vwxyz vw 08 xyz 0

CS262 Lecture 12, Win07, Batzoglou Ultrametric Distances and Molecular Clock Definition: A distance function d(.,.) is ultrametric if for any three distances d ij  d ik  d ij, it is true that d ij  d ik = d ij The Molecular Clock: The evolutionary distance between species x and y is 2  the Earth time to reach the nearest common ancestor That is, the molecular clock has constant rate in all species 14235 years The molecular clock results in ultrametric distances

CS262 Lecture 12, Win07, Batzoglou Ultrametric Distances & Average Linkage Average Linkage is guaranteed to reconstruct correctly a binary tree with ultrametric distances Proof: Exercise 1423 5

CS262 Lecture 12, Win07, Batzoglou Weakness of Average Linkage Molecular clock: all species evolve at the same rate (Earth time) However, certain species (e.g., mouse, rat) evolve much faster Example where UPGMA messes up: 2 3 4 1 1 4 3 2 Correct tree AL tree

CS262 Lecture 12, Win07, Batzoglou Additive Distances Given a tree, a distance measure is additive if the distance between any pair of leaves is the sum of lengths of edges connecting them Given a tree T & additive distances d ij, can uniquely reconstruct edge lengths: Find two neighboring leaves i, j, with common parent k Place parent node k at distance d km = ½ (d im + d jm – d ij ) from any node m  i, j 1 2 3 4 5 6 7 8 9 10 12 11 13 d 1,4

CS262 Lecture 12, Win07, Batzoglou Additive Distances For any four leaves x, y, z, w, consider the three sums d(x, y) + d(z, w) d(x, z) + d(y, w) d(x, w) + d(y, z) One of them is smaller than the other two, which are equal d(x, y) + d(z, w) < d(x, z) + d(y, w) = d(x, w) + d(y, z) x y z w

CS262 Lecture 12, Win07, Batzoglou Reconstructing Additive Distances Given T x y z w v 5 4 7 3 3 4 6 vwxyz v 0101716 w 01514 x 0915 y 014 z 0 T If we know T and D, but do not know the length of each leaf, we can reconstruct those lengths D

CS262 Lecture 12, Win07, Batzoglou Reconstructing Additive Distances Given T x y z w v vwxyz v 0101716 w 01514 x 0915 y 014 z 0 T D

CS262 Lecture 12, Win07, Batzoglou Reconstructing Additive Distances Given T x y z w v vwxyz v 0101716 w 01514 x 0915 y 014 z 0 T D axyz a 01110 x 0915 y 014 z 0 a D1D1 d ax = ½ (d vx + d wx – d vw ) d ay = ½ (d vy + d wy – d vw ) d az = ½ (d vz + d wz – d vw )

CS262 Lecture 12, Win07, Batzoglou Reconstructing Additive Distances Given T x y z w v T axyz a 01110 x 0915 y 014 z 0 a D1D1 abz a 0610 b 0 z 0 D2D2 b c ac a 03 c 0 D3D3 d(a, c) = 3 d(b, c) = d(a, b) – d(a, c) = 3 d(c, z) = d(a, z) – d(a, c) = 7 d(b, x) = d(a, x) – d(a, b) = 5 d(b, y) = d(a, y) – d(a, b) = 4 d(a, w) = d(z, w) – d(a, z) = 4 d(a, v) = d(z, v) – d(a, z) = 6 Correct!!! 5 4 7 3 3 4 6

CS262 Lecture 12, Win07, Batzoglou Neighbor-Joining Guaranteed to produce the correct tree if distance is additive May produce a good tree even when distance is not additive Step 1: Finding neighboring leaves Define D ij = (N – 2) d ij –  k  i d ik –  k  j d jk Claim: The above “magic trick” ensures that D ij is minimal iff i, j are neighbors 1 2 4 3 0.1 0.4

CS262 Lecture 12, Win07, Batzoglou Algorithm: Neighbor-joining Initialization: Define T to be the set of leaf nodes, one per sequence Let L = T Iteration: Pick i, j s.t. D ij is minimal Define a new node k, and set d km = ½ (d im + d jm – d ij ) for all m  L Add k to T, with edges of lengths d ik = ½ (d ij + r i – r j ), d jk = d ij – d ik Remove i, j from L; Add k to L Termination: When L consists of two nodes, i, j, and the edge between them of length d ij

CS262 Lecture 12, Win07, Batzoglou Parsimony – direct method not using distances One of the most popular methods:  GIVEN multiple alignment  FIND tree & history of substitutions explaining alignment Idea: Find the tree that explains the observed sequences with a minimal number of substitutions Two computational subproblems: 1.Find the parsimony cost of a given tree (easy) 2.Search through all tree topologies (hard)

CS262 Lecture 12, Win07, Batzoglou Example: Parsimony cost of one column A B A A {A, B} Cost C+=1 {A} Final cost C = 1 {A} {B} {A} ABAAABAA

CS262 Lecture 12, Win07, Batzoglou Parsimony Scoring Given a tree, and an alignment column u Label internal nodes to minimize the number of required substitutions Initialization: Set cost C = 0; node k = 2N – 1 (last leaf) Iteration: If k is a leaf, set R k = { x k [u] }// R k is simply the character of k th species If k is not a leaf, Let i, j be the daughter nodes; Set R k = R i  R j if intersection is nonempty Set R k = R i  R j, and C += 1, if intersection is empty Termination: Minimal cost of tree for column u, = C

CS262 Lecture 12, Win07, Batzoglou Example AAAB {A} {B} BABA {A}{B}{A}{B} {A} {A,B} {B}

CS262 Lecture 12, Win07, Batzoglou Traceback: 1.Choose an arbitrary nucleotide from R 2N – 1 for the root 2.Having chosen nucleotide r for parent k, If r  R i choose r for daughter i Else, choose arbitrary nucleotide from R i Easy to see that this traceback produces some assignment of cost C Traceback to find ancestral nucleotides

CS262 Lecture 12, Win07, Batzoglou Example A B A B {A, B} {A} {B} {A} {B} A B A B A A A x x A B A B A B A x x A B A B B B B x x Admissible with Traceback Still optimal, but inadmissible with Traceback

CS262 Lecture 12, Win07, Batzoglou Probabilistic Methods A more refined measure of evolution along a tree than parsimony P(x 1, x 2, x root | t 1, t 2 ) = P(x root ) P(x 1 | t 1, x root ) P(x 2 | t 2, x root ) If we use Jukes-Cantor, for example, and x 1 = x root = A, x 2 = C, t 1 = t 2 = 1, = p A  ¼(1 + 3e -4α )  ¼(1 – e -4α ) = (¼) 3 (1 + 3e -4α )(1 – e -4α ) x1x1 t2t2 x root t1t1 x2x2

CS262 Lecture 12, Win07, Batzoglou Probabilistic Methods If we know all internal labels x u, P(x 1, x 2, …, x N, x N+1, …, x 2N-1 | T, t) = P(x root )  j  root P(x j | x parent(j), t j, parent(j) ) Usually we don’t know the internal labels, therefore P(x 1, x 2, …, x N | T, t) =  x N+1  x N+2 …  x 2N-1 P(x 1, x 2, …, x 2N-1 | T, t) x root x1x1 x2x2 xNxN xuxu

CS262 Lecture 12, Win07, Batzoglou Probabilistic Methods Given M (ungapped) alignment columns of N sequences, Define likelihood of a tree: L(T, t) = P(Data | T, t) =  m=1…M P(x 1m, …, x nm, T, t) Maximum Likelihood Reconstruction: Given data X = (x ij ), find a topology T and length vector t that maximize likelihood L(T, t)

CS262 Lecture 12, Win07, Batzoglou Current popular methods HUNDREDS of programs available! http://evolution.genetics.washington.edu/phylip/software.html#methods Some recommended programs: Discrete—Parsimony-based  Rec-1-DCM3 http://www.cs.utexas.edu/users/tandy/mp.html Tandy Warnow and colleagues Probabilistic  SEMPHY http://www.cs.huji.ac.il/labs/compbio/semphy/ Nir Friedman and colleagues

CS262 Lecture 12, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 12, Win07, Batzoglou Protein Phylogenies Proteins evolve by both duplication and species divergence

CS262 Lecture 12, Win07, Batzoglou Protein Phylogenies – Example

CS262 Lecture 12, Win07, Batzoglou

Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global map is maximum A faint similarity between two sequences becomes significant if present in many Multiple alignments can point to elements that are conserved among a class of and therefore important in the biology of these organisms The patterns of conservation can help us tell function of the element

CS262 Lecture 12, Win07, Batzoglou Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

CS262 Lecture 12, Win07, Batzoglou Sum Of Pairs (cont’d) Heuristic way to incorporate evolution tree: Human Mouse Chicken Weighted SOP: S(m) =  k<l w kl s(m k, m l ) Duck

CS262 Lecture 12, Win07, Batzoglou A Profile Representation Given a multiple alignment M = m 1 …m n  Replace each column m i with profile entry p i Frequency of each letter in  # gaps Optional: # gap openings, extensions, closings  Can think of this as a “likelihood” of each letter in each position - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1.8 C.6 1.4 1.6.2 G 1.2.2.4 1 T.2 1.6.2 -.2.8.4.8.4

CS262 Lecture 12, Win07, Batzoglou Multiple Sequence Alignments Algorithms

CS262 Lecture 12, Win07, Batzoglou Multidimensional DP Generalization of Needleman-Wunsh: S(m) =  i S(m i ) (sum of column scores) F(i 1,i 2,…,i N ): Optimal alignment up to (i 1, …, i N ) F(i 1,i 2,…,i N )= max (all neighbors of cube) (F(nbr)+S(nbr))

CS262 Lecture 12, Win07, Batzoglou Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i – 1, j – 1, k – 1) + S(x i, x j, x k ), F(i – 1, j – 1, k ) + S(x i, x j, - ), F(i – 1, j, k – 1) + S(x i, -, x k ), F(i – 1, j, k ) + S(x i, -, - ), F(i, j – 1, k – 1) + S( -, x j, x k ), F(i, j – 1, k ) + S( -, x j, - ), F(i, j, k – 1) + S( -, -, x k ) } Multidimensional DP

CS262 Lecture 12, Win07, Batzoglou Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP

CS262 Lecture 12, Win07, Batzoglou Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP How do gap states generalize? VERY badly!  Require 2 N – 1 states, one per combination of gapped/ungapped sequences  Running time: O(2 N  2 N  L N ) = O(4 N L N ) XYXYZZ YYZ XXZ

CS262 Lecture 12, Win07, Batzoglou Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z p xy p zw p xyzw

CS262 Lecture 12, Win07, Batzoglou Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z Example Profile: (A, C, G, T, -) p x = (0.8, 0.2, 0, 0, 0) p y = (0.6, 0, 0, 0, 0.4) s(p x, p y ) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -) Result: p xy = (0.7, 0.1, 0, 0, 0.2) s(p x, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -) Result: p x- = (0.4, 0.1, 0, 0, 0.5)

CS262 Lecture 12, Win07, Batzoglou Progressive Alignment When evolutionary tree is unknown:  Perform all pairwise alignments  Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment  Construct a tree (UPGMA / Neighbor Joining / Other methods)  Align on the tree x w y z ?

CS262 Lecture 12, Win07, Batzoglou Heuristics to improve alignments Iterative refinement schemes A*-based search Consistency Simulated Annealing …

CS262 Lecture 12, Win07, Batzoglou Iterative Refinement One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear correct y = GA-CTT

CS262 Lecture 12, Win07, Batzoglou Iterative Refinement Algorithm (Barton-Stenberg): 1.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 2.Repeat 4 until convergence x y z x,z fixed projection allow y to vary

CS262 Lecture 12, Win07, Batzoglou Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA

CS262 Lecture 12, Win07, Batzoglou Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing

CS262 Lecture 12, Win07, Batzoglou A* for Multiple Alignments Review of the A* algorithm v START GOAL Say that we have a gigantic graph G START: start node GOAL: we want to reach this node with the minimum path Dijkstra: O(VlogV + E) – too slow if the number of edges is huge A*: a way of finding the optimal solution faster in practice

CS262 Lecture 12, Win07, Batzoglou A* for Multiple Alignments Review of the A* algorithm v START GOAL g(v) h(v) g(v) is the cost so far h(v) is an estimate of the minimum cost from v to GOAL f(v) ≥ g(v) + h(v) is the minimum cost of a path passing by v 1. Expand v with the smallest f(v) 2. Never expand v, if f(v) ≥ shortest path to the goal found so far Lemma Given sequences x, y, z, … The sum-of pairs score of multiple alignment M is lower (worse) than the sum of the optimal pairwise alignments Proof M induces projected pairwise alignments a xy, a yz, a xz, …, and Score(M) = d(a xy ) + d(a xz ) + d(a yz ) +… Each of d(.) is smaller than the optimal edit distance

CS262 Lecture 12, Win07, Batzoglou A* for Multiple Alignments Nodes: Cells in the DP matrix g(v): alignment cost so far h(v): sum-of-pairs of individual pairwise alignments Initial minimum alignment cost estimate: sum-of-pairs of global pairwise alignments v START GOAL g(v) h(v) To compute h(v) For each pair of sequences x, y, Compute F R (x, y), the DP matrix of scores of aligning a suffix of x to a suffix of y Then, at position (i 1, i 2, …, i N ), h(v) becomes the sum of (N choose 2) F R scores

CS262 Lecture 12, Win07, Batzoglou Consistency z x y xixi yjyj y j’ zkzk

CS262 Lecture 12, Win07, Batzoglou Consistency Basic method for applying consistency Compute all pairs of alignments xy, xz, yz, … When aligning x, y during progressive alignment,  For each (x i, y j ), let s(x i, y j ) = function_of(x i, y j, a xz, a yz )  Align x and y with DP using the modified s(.,.) function z x y xixi yjyj y j’ zkzk

CS262 Lecture 12, Win07, Batzoglou Some Resources Genome Resources Annotation and alignment genome browser at UCSC http://genome.ucsc.edu/cgi-bin/hgGateway Specialized VISTA alignment browser at LBNL http://pipeline.lbl.gov/cgi-bin/gateway2 ABC—Nice Stanford tool for browsing alignments http://encode.stanford.edu/~asimenos/ABC/ Protein Multiple Aligners http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable http://probcons.stanford.edu/ PROBCONS – most accurate

CS262 Lecture 12, Win07, Batzoglou Some new sequencing technologies.

Similar presentations

Presentation on theme: "CS262 Lecture 12, Win07, Batzoglou Some new sequencing technologies."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS262 Lecture 12, Win07, Batzoglou Some new sequencing technologies.

Similar presentations

Presentation on theme: "CS262 Lecture 12, Win07, Batzoglou Some new sequencing technologies."— Presentation transcript:

Similar presentations

About project

Feedback