DNA Sequencing.

Slides:



Advertisements
Similar presentations
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Advertisements

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Molecular Evolution and Phylogenetic Tree Reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Problem Set 2 Solutions Tree Reconstruction Algorithms
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
CS262 Lecture 12, Win06, Batzoglou RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Phylogeny Tree Reconstruction
CS273a Lecture 5, Win07, Batzoglou Quality of assemblies—mouse N50 contig length Terminology: N50 contig length If we sort contigs from largest to smallest,
Overview of Phylogeny Artiodactyla (pigs, deer, cattle, goats, sheep, hippopotamuses, camels, etc.) Cetacea (whales, dolphins, porpoises)
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm.
Phylogeny Tree Reconstruction
Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
DNA Sequencing Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host)
Assembly.
CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction
DNA Sequencing and Assembly
DNA Sequencing.
CS273a Lecture 4, Autumn 08, Batzoglou Fragment Assembly (in whole-genome shotgun sequencing) CS273a Lecture 5.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
DNA Sequencing. CS273a Lecture 3, Spring 07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
CS262 Lecture 12, Win07, Batzoglou Some new sequencing technologies.
Phylogeny Tree Reconstruction
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
DNA Sequencing. CS262 Lecture 9, Win06, Batzoglou DNA Sequencing – gel electrophoresis 1.Start at primer(restriction site) 2.Grow DNA chain 3.Include.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogeny Tree Reconstruction
Perfect Phylogeny MLE for Phylogeny Lecture 14
CS262 Lecture 9, Win07, Batzoglou Conditional Random Fields A brief description.
Phylogenetic trees Sushmita Roy BMI/CS 576
CS273a Lecture 4, Autumn 08, Batzoglou CS273a 2011 DNA Sequencing.
1 Chapter 7 Building Phylogenetic Trees. 2 Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances –UPGMA method.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Evolutionary tree reconstruction
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Multiple Sequence Alignment
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
DNA Sequencing Project
Fragment Assembly (in whole-genome shotgun sequencing)
dij(T) - the length of a path between leaves i and j
Phylogeny.
Presentation transcript:

DNA Sequencing

Whole Genome Shotgun Sequencing cut many times at random plasmids (2 – 10 Kbp) forward-reverse paired reads known dist cosmids (40 Kbp) ~500 bp ~500 bp

Steps to Assemble a Genome Some Terminology read a 500-900 long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads with no gaps supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the sequene multiple alignment of reads in a contig 1. Find overlapping reads 2. Merge some “good” pairs of reads into longer contigs 3. Link contigs to form supercontigs 4. Derive consensus sequence ..ACGATTACAATAGGTT..

2. Merge Reads into Contigs Overlap graph: Nodes: reads r1…..rn Edges: overlaps (ri, rj, shift, orientation, score) Reads that come from two regions of the genome (blue and red) that contain the same repeat Note: of course, we don’t know the “color” of these nodes

2. Merge Reads into Contigs repeat region Unique Contig Overcollapsed Contig We want to merge reads up to potential repeat boundaries

2. Merge Reads into Contigs repeat region Ignore non-maximal reads Merge only maximal reads into contigs

2. Merge Reads into Contigs Remove transitively inferable overlaps If read r overlaps to the right reads r1, r2, and r1 overlaps r2, then (r, r2) can be inferred by (r, r1) and (r1, r2)

2. Merge Reads into Contigs

Overlap graph after forming contigs Unitigs: Gene Myers, 95

Repeats, errors, and contig lengths Repeats shorter than read length are easily resolved Read that spans across a repeat disambiguates order of flanking regions Repeats with more base pair diffs than sequencing error rate are OK We throw overlaps between two reads in different copies of the repeat To make the genome appear less repetitive, try to: Increase read length Decrease sequencing error rate Role of error correction: Discards up to 98% of single-letter sequencing errors decreases error rate  decreases effective repeat content  increases contig length

3. Link Contigs into Supercontigs Normal density Too dense  Overcollapsed Inconsistent links  Overcollapsed?

3. Link Contigs into Supercontigs Find all links between unique contigs Connect contigs incrementally, if  2 links supercontig (aka scaffold)

3. Link Contigs into Supercontigs Fill gaps in supercontigs with paths of repeat contigs

4. Derive Consensus Sequence TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive multiple alignment from pairwise read alignments Derive each consensus base by weighted voting (Alternative: take maximum-quality letter)

Some Assemblers PHRAP Celera Arachne Phusion Euler Early assembler, widely used, good model of read errors Overlap O(n2)  layout (no mate pairs)  consensus Celera First assembler to handle large genomes (fly, human, mouse) Overlap  layout  consensus Arachne Public assembler (mouse, several fungi) Phusion Overlap  clustering  PHRAP  assemblage  consensus Euler Indexing  Euler graph  layout by picking paths  consensus

Quality of assemblies—mouse Terminology: N50 contig length If we sort contigs from largest to smallest, and start Covering the genome in that order, N50 is the length Of the contig that just covers the 50th percentile. 7.7X sequence coverage

Quality of assemblies—dog 7.5X sequence coverage

History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 1997 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present human (3Gbp), mouse (2.5Gbp), rat*, chicken, dog, chimpanzee, several fungal genomes Let’s sequence the human genome with the shotgun strategy That is impossible, and a bad idea anyway Phil Green Gene Myers

Genomes Sequenced http://www.genome.gov/10002154

Phylogeny Tree Reconstruction 1 4 3 2 5 Phylogeny Tree Reconstruction

Phylogenetic Trees Nodes: species Edges: time of independent evolution Edge length represents evolution time AKA genetic distance Not necessarily chronological time

Inferring Phylogenetic Trees Trees can be inferred by several criteria: Morphology of the organisms Can lead to mistakes! Sequence comparison Example: Orc: ACAGTGACGCCCCAAACGT Elf: ACAGTGACGCTACAAACGT Dwarf: CCTGTGACGTAACAAACGA Hobbit: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

Phylogeny and sequence comparison Basic principles: Degree of sequence difference is proportional to length of independent sequence evolution Only use positions where alignment is certain – avoid areas with (too many) gaps

Distance between two sequences Given sequences xi, xj, Define dij = distance between the two sequences One possible definition: dij = fraction f of sites u where xi[u]  xj[u] Better scores are derived by modeling evolution as a continuous change process – not covered here Jukes Kantor, Kimura, etc.

A simple clustering method for building tree UPGMA (unweighted pair group method using arithmetic averages) Or the Average Linkage Method Given two disjoint clusters Ci, Cj of sequences, 1 dij = ––––––––– {p Ci, q Cj}dpq |Ci|  |Cj| Claim that if Ck = Ci  Cj, then distance to another cluster Cl is: dil |Ci| + djl |Cj| dkl = –––––––––––––– |Ci| + |Cj| Proof Ci,Cl dpq + Cj,Cl dpq dkl = –––––––––––––––– (|Ci| + |Cj|) |Cl| |Ci|/(|Ci||Cl|) Ci,Cl dpq + |Cj|/(|Cj||Cl|) Cj,Cl dpq = –––––––––––––––––––––––––––––––––––– (|Ci| + |Cj|) |Ci| dil + |Cj| djl = –––––––––––––

Algorithm: Average Linkage Initialization: Assign each xi into its own cluster Ci Define one leaf per sequence, height 0 Iteration: Find two clusters Ci, Cj s.t. dij is min Let Ck = Ci  Cj Define node connecting Ci, Cj, & place it at height dij/2 Delete Ci, Cj Termination: When two clusters i, j remain, place root at height dij/2 1 4 3 2 5 1 4 2 3 5

Example v w x y z v w xyz vw xyz v w x yz 6 8 4 2 6 8 8 4 6 8 4 3 2 1 6 8 4 2 v w xyz 6 8 vw xyz 8 4 v w x yz 6 8 4 3 2 1 v w x y z

Ultrametric Distances and Molecular Clock Definition: A distance function d(.,.) is ultrametric if for any three distances dij  dik  dij, it is true that dij  dik = dij The Molecular Clock: The evolutionary distance between species x and y is 2 the Earth time to reach the nearest common ancestor That is, the molecular clock has constant rate in all species The molecular clock results in ultrametric distances years 1 4 2 3 5

Ultrametric Distances & Average Linkage 1 4 2 3 5 Average Linkage is guaranteed to reconstruct correctly a binary tree with ultrametric distances Proof: Exercise

Weakness of Average Linkage Molecular clock: all species evolve at the same rate (Earth time) However, certain species (e.g., mouse, rat) evolve much faster Example where UPGMA messes up: Correct tree AL tree 3 2 4 1 4 2 3 1

Additive Distances 1 d1,4 12 4 8 3 13 7 9 5 11 10 6 2 Given a tree, a distance measure is additive if the distance between any pair of leaves is the sum of lengths of edges connecting them Given a tree T & additive distances dij, can uniquely reconstruct edge lengths: Find two neighboring leaves i, j, with common parent k Place parent node k at distance dkm = ½ (dim + djm – dij) from any node m  i, j

d(x, y) + d(z, w) < d(x, z) + d(y, w) = d(x, w) + d(y, z) Additive Distances z x w y For any four leaves x, y, z, w, consider the three sums d(x, y) + d(z, w) d(x, z) + d(y, w) d(x, w) + d(y, z) One of them is smaller than the other two, which are equal d(x, y) + d(z, w) < d(x, z) + d(y, w) = d(x, w) + d(y, z)

Reconstructing Additive Distances Given T x T D y 5 4 v w x y z 10 17 16 15 14 9 3 z 3 4 7 w 6 v If we know T and D, but do not know the length of each leaf, we can reconstruct those lengths

Reconstructing Additive Distances Given T x T D y v w x y z 10 17 16 15 14 9 z w v

Reconstructing Additive Distances Given T x v w x y z 10 17 16 15 14 9 T y z a w D1 v dax = ½ (dvx + dwx – dvw) a x y z 11 10 9 15 14 day = ½ (dvy + dwy – dvw) daz = ½ (dvz + dwz – dvw)

Reconstructing Additive Distances Given T x a x y z 11 10 9 15 14 T y 5 4 b 3 z 3 a c 4 7 w D2 6 d(a, c) = 3 d(b, c) = d(a, b) – d(a, c) = 3 d(c, z) = d(a, z) – d(a, c) = 7 d(b, x) = d(a, x) – d(a, b) = 5 d(b, y) = d(a, y) – d(a, b) = 4 d(a, w) = d(z, w) – d(a, z) = 4 d(a, v) = d(z, v) – d(a, z) = 6 Correct!!! v a b z 6 10 D3 a c 3

Neighbor-Joining Guaranteed to produce the correct tree if distance is additive May produce a good tree even when distance is not additive Step 1: Finding neighboring leaves Define Dij = (N – 2) dij – ki dik – kj djk Claim: The above “magic trick” ensures that Dij is minimal iff i, j are neighbors 1 3 0.1 0.1 0.1 0.4 0.4 2 4

Algorithm: Neighbor-joining Initialization: Define T to be the set of leaf nodes, one per sequence Let L = T Iteration: Pick i, j s.t. Dij is minimal Define a new node k, and set dkm = ½ (dim + djm – dij) for all m  L Add k to T, with edges of lengths dik = ½ (dij + ri – rj), djk = dij – dik where ri = (N – 2)-1 ki dik Remove i, j from L; Add k to L Termination: When L consists of two nodes, i, j, and the edge between them of length dij

Parsimony – direct method not using distances One of the most popular methods: GIVEN multiple alignment FIND tree & history of substitutions explaining alignment Idea: Find the tree that explains the observed sequences with a minimal number of substitutions Two computational subproblems: Find the parsimony cost of a given tree (easy) Search through all tree topologies (hard)

Example: Parsimony cost of one column Final cost C = 1 {A} {A, B} Cost C+=1 A B A B A A {A} {B} {A} {A}

Parsimony Scoring Given a tree, and an alignment column u Label internal nodes to minimize the number of required substitutions Initialization: Set cost C = 0; node k = 2N – 1 (last leaf) Iteration: If k is a leaf, set Rk = { xk[u] } // Rk is simply the character of kth species If k is not a leaf, Let i, j be the daughter nodes; Set Rk = Ri  Rj if intersection is nonempty Set Rk = Ri  Rj, and C += 1, if intersection is empty Termination: Minimal cost of tree for column u, = C

Example {B} {A,B} {A} {B} {A} {A,B} {A} A A A A B B A B {A} {A} {A}

Traceback to find ancestral nucleotides Choose an arbitrary nucleotide from R2N – 1 for the root Having chosen nucleotide r for parent k, If r  Ri choose r for daughter i Else, choose arbitrary nucleotide from Ri Easy to see that this traceback produces some assignment of cost C

inadmissible with Traceback Example Admissible with Traceback x B Still optimal, but inadmissible with Traceback A {A, B} A B x {A} B {A, B} A B A B x B x A B A B A {A} {B} {A} {B} A B A B A x A x A B A B

Probabilistic Methods xroot t1 t2 x1 x2 A more refined measure of evolution along a tree than parsimony P(x1, x2, xroot | t1, t2) = P(xroot) P(x1 | t1, xroot) P(x2 | t2, xroot) If we use Jukes-Cantor, for example, and x1 = xroot = A, x2 = C, t1 = t2 = 1, = pA¼(1 + 3e-4α) ¼(1 – e-4α) = (¼)3(1 + 3e-4α)(1 – e-4α)

Probabilistic Methods xroot xu x2 xN x1 If we know all internal labels xu, P(x1, x2, …, xN, xN+1, …, x2N-1 | T, t) = P(xroot)jrootP(xj | xparent(j), tj, parent(j)) Usually we don’t know the internal labels, therefore P(x1, x2, …, xN | T, t) = xN+1 xN+2 … x2N-1 P(x1, x2, …, x2N-1 | T, t)

Felsenstein’s Likelihood Algorithm To calculate P(x1, x2, …, xN | T, t) Initialization: Set k = 2N – 1 Iteration: Compute P(Lk | a) for all a   If k is a leaf node: Set P(Lk | a) = 1(a = xk) If k is not a leaf node: 1. Compute P(Li | b), P(Lj | b) for all b, for daughter nodes i, j 2. Set P(Lk | a) = b,c P(b | a, ti) P(Li | b) P(c | a, tj) P(Lj | c) Termination: Likelihood at this column = P(x1, x2, …, xN | T, t) = aP(L2N-1 | a)P(a) Let P(Lk | a) denote the prob. of all the leaves below node k, given that the residue at k is a

Probabilistic Methods Given M (ungapped) alignment columns of N sequences, Define likelihood of a tree: L(T, t) = P(Data | T, t) = m=1…M P(x1m, …, xnm, T, t) Maximum Likelihood Reconstruction: Given data X = (xij), find a topology T and length vector t that maximize likelihood L(T, t)

Current popular methods HUNDREDS of programs available! http://evolution.genetics.washington.edu/phylip/software.html#methods Some recommended programs: Discrete—Parsimony-based Rec-1-DCM3 http://www.cs.utexas.edu/users/tandy/mp.html Tandy Warnow and colleagues Probabilistic SEMPHY http://www.cs.huji.ac.il/labs/compbio/semphy/ Nir Friedman and colleagues