DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.

Slides:



Advertisements
Similar presentations
Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
Advertisements

9 Genomics and Beyond Brief Chapter Outline
Connection Between Alignment and HMMs. A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
DNA Sequencing Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host)
DNA Sequencing.
DNA Sequencing. CS273a Lecture 3, Spring 07, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.
Assembly.
DNA Sequencing. Next few topics DNA Sequencing  Sequencing strategies Hierarchical Online (Walking) Whole Genome Shotgun  Sequencing Assembly Gene Recognition.
DNA Sequencing and Assembly
DNA Sequencing.
CS273a Lecture 4, Autumn 08, Batzoglou Fragment Assembly (in whole-genome shotgun sequencing) CS273a Lecture 5.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
DNA Sequencing. CS273a Lecture 3, Spring 07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
DNA Sequencing. CS262 Lecture 9, Win07, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
DNA Sequencing.
DNA Sequencing. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT.
Conditional Random Fields 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
DNA Sequencing. CS262 Lecture 9, Win06, Batzoglou DNA Sequencing – gel electrophoresis 1.Start at primer(restriction site) 2.Grow DNA chain 3.Include.
DNA Sequencing. CS262 Lecture 9, Win06, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.
CS273a Lecture 1, Autumn 10, Batzoglou DNA Sequencing.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
DNA Sequencing. CS273a Lecture 3, Autumn 08, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
CS273a Lecture 4, Autumn 08, Batzoglou DNA Sequencing.
CS262 Lecture 9, Win07, Batzoglou Conditional Random Fields A brief description.
DNA Sequencing. Next few topics DNA Sequencing  Sequencing strategies Hierarchical Online (Walking) Whole Genome Shotgun  Sequencing Assembly Gene Recognition.
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Analysis Determine locus & sequence of all the organism’s genes More than 100 genomes have been analysed including humans in the Human Genome Project.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Presentation on genome sequencing. Genome: the complete set of gene of an organism Genome annotation: the process by which the genes, control sequences.
Mouse Genome Sequencing
CS 394C March 19, 2012 Tandy Warnow.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
CS273a Lecture 4, Autumn 08, Batzoglou CS273a 2011 DNA Sequencing.
Genome sequencing Haixu Tang School of Informatics.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Linkage and Mapping. Figure 4-8 For linked genes, recombinant frequencies are less than 50 percent.
CS273a Lecture 4, Autumn 08, Batzoglou CS273a 2013 DNA Structure.
Human Genome.
DNA Sequencing.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Plan A Topics? 1.Making a probiotic strain of E.coli that destroys oxalate to help treat kidney stones in collaboration with Dr. Lucent and Dr. VanWert.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
VL Algorithmische BioInformatik (19710) WS2015/2016 Woche 7 - Mittwoch Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie.
Genome Research 12:1 (2002), Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments.
Genome Analysis. This involves finding out the: order of the bases in the DNA location of genes parts of the DNA that controls the activity of the genes.
Title: Studying whole genomes Homework: learning package 14 for Thursday 21 June 2016.
CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly
DNA Sequencing.
DNA Sequencing Project
Fragment Assembly (in whole-genome shotgun sequencing)
Pre-genomic era: finding your own clones
Stuff to Do.
A Sequenciação em Análises Clínicas
CSE 5290: Algorithms for Bioinformatics Fall 2009
CSCI 1810 Computational Molecular Biology 2018
Introduction to Sequencing
Sequence the 3 billion base pairs of human
Presentation transcript:

DNA Sequencing and Assembly

DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

Which representative of the species? Which human? Answer one: Answer two: it doesn’t matter Polymorphism rate: number of letter changes between two different members of a species Humans: ~1/1,000 – 1/10,000 Other organisms have much higher polymorphism rates

DNA sequencing – vectors + = DNA Shake DNA fragments Vector Circular genome (bacterium, plasmid) Known location (restriction site)

Different types of vectors VECTORSize of insert Plasmid 2,000-10,000 Can control the size Cosmid40,000 BAC (Bacterial Artificial Chromosome) 70, ,000 YAC (Yeast Artificial Chromosome) > 300,000 Not used much recently

DNA sequencing – gel electrophoresis Start at primer (restriction site) Grow DNA chain Include dideoxynucleoside (modified a, c, g, t) Stops reaction at all possible points Separate products with length, using gel electrophoresis

Electrophoresis diagrams

Output of PHRAP: a read A read: nucleotides A C G A A T C A G …. A Quality scores: -10  log 10 Prob(Error) Reads can be obtained from leftmost, rightmost ends of the insert Double-barreled sequencing: Both leftmost & rightmost ends are sequenced

Method to sequence segments longer than 500 cut many times at random (Shotgun) genomic segment Get one or two reads from each segment ~500 bp

Reconstructing the Sequence (Fragment Assembly) Cover region with ~7-fold redundancy (7X) Overlap reads and extend to reconstruct the original genomic region reads

Definition of Coverage Length of genomic segment:L Number of reads: n Length of each read:l Definition: Coverage C = nl/L How much coverage is enough? (Lander-Waterman model): Assuming uniform distribution of reads, C=10 results in 1 gapped region /1,000,000 nucleotides C

Challenges with Fragment Assembly Sequencing errors ~1-2% of bases are wrong Repeats Computation: ~ O( N 2 ) where N = # reads false overlap due to repeat

Repeats Bacterial genomes:5% Mammals:50% Repeat types: Low-Complexity DNA(e.g. ATATATATACATA…) Microsatellite repeats: (a 1 …a k ) N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) Common Repeat Families SINE (Short Interspersed Nuclear Elements) (e.g. ALU: ~300-long, 10 6 copies) LINE (Long Interspersed Nuclear Elements) ~500-5,000-long, 200,000 copies MIR LTR/Retroviral Other -Genes that are duplicated & then diverge (paralogs) -Recent duplications, ~100,000-long, very similar copies

Strategies for sequencing a whole genome 1.Hierarchical – Clone-by-clone i.Break genome into many long pieces ii.Map each long piece onto the genome iii.Sequence each piece with shotgun Example: Yeast, Worm, Human, Rat 2.Online version of (1) – Walking i.Break genome into many long pieces ii.Start sequencing each piece with shotgun iii.Construct map as you go Example: Rice genome 3.Whole genome shotgun One large shotgun pass on the whole genome Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Fugu

Hierarchical Sequencing

Hierarchical Sequencing Strategy 1.Obtain a large collection of BAC clones 2.Map them onto the genome (Physical Mapping) 3.Select a minimum tiling path 4.Sequence each clone in the path with shotgun 5.Assemble 6.Put everything together a BAC clone map genome

Methods of physical mapping Goal: Make a map of the locations of each clone relative to one another Use the map to select a minimal set of clones to sequence Methods: Hybridization Digestion

1. Hybridization Short words, the probes, attach to complementary words 1.Construct many probes 2.Treat each BAC with all probes 3.Record which ones attach to it 4.Same words attaching to BACS X, Y  overlap p1p1 pnpn

Hybridization – Computational Challenge Matrix: m probes  n clones (i, j):1, if p i hybridizes to C j 0, otherwise Definition: Consecutive ones matrix A matrix 1s are consecutive Computational problem: Reorder the probes so that matrix is in consecutive-ones form Can be solved in O(m 3 ) time (m >> n) Unfortunately, data is not perfect p 1 p 2 …………………….p m C 1 C 2 ……………….C n 1 0 1………………… ………………… …………………..1 p i1 p i2 …………………….p im C j1 C j2 ……………….C jn …………… …………… …………… ……… ………

2.Digestion Restriction enzymes cut DNA where specific words appear 1.Cut each clone separately with an enzyme 2.Run fragments on a gel and measure length 3.Clones C a, C b have fragments of length { l i, l j, l k }  overlap Double digestion: Cut with enzyme A, enzyme B, then enzymes A + B

Whole-Genome Shotgun Sequencing

Whole Genome Shotgun Sequencing cut many times at random genome forward-reverse linked reads plasmids (2 – 10 Kbp) cosmids (40 Kbp) known dist ~500 bp

ARACHNE: Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT.. 2. Merge good pairs of reads into longer contigs 3. Link contigs to form supercontigs + many heuristics

1. Find Overlapping Reads Sort all k-mers in reads (k ~ 24) TAGATTACACAGATTAC ||||||||||||||||| Find pairs of reads sharing a k-mer Extend to full alignment – throw away if not >95% similar T GA TAGA | || TACA TAGT ||

1. Find Overlapping Reads One caveat: repeats A k-mer that appears N times, initiates N 2 comparisons ALU: 1,000,000 times Solution: Discard all k-mers that appear more than c  Coverage, (c ~ 10)

1. Find Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA

1. Find Overlapping Reads (cont’d) Correct errors using multiple alignment TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA C: 20 C: 35 T: 30 C: 35 C: 40 C: 20 C: 35 C: 0 C: 35 C: 40 Score alignments Accept alignments with good scores A: 15 A: 25 A: 40 A: 25 - A: 15 A: 25 A: 40 A: 25 A: 0

Basic principle of assembly Repeats confuse us Ability to merge two reads  ability to detect repeats We can dismiss as repeat any overlap of < t% similarity Role of error correction: Discards ~90% of single-letter sequencing errors  Threshold t% increases

2. Merge Reads into Contigs (cont’d) Merge reads up to potential repeat boundaries (Myers, 1995) repeat region

2. Merge Reads into Contigs (cont’d) Ignore non-maximal reads Merge only maximal reads into contigs repeat region

2. Merge Reads into Contigs (cont’d) Ignore “hanging” reads, when detecting repeat boundaries sequencing error repeat boundary??? b a

2. Merge Reads into Contigs (cont’d) ????? Unambiguous Insert non-maximal reads whenever unambiguous

3. Link Contigs into Supercontigs Too dense: Overcollapsed? (Myers et al. 2000) Inconsistent links: Overcollapsed? Normal density

Find all links between unique contigs 3. Link Contigs into Supercontigs (cont’d) Connect contigs incrementally, if  2 links

Fill gaps in supercontigs with paths of overcollapsed contigs 3. Link Contigs into Supercontigs (cont’d)

4. Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting

Mouse Genome Improved version of ARACHNE assembled the mouse genome Several heuristics of iteratively: Breaking supercontigs that are suspicious Rejoining supercontigs Size of problem: 32,000,000 reads Time: 15 days, 1 processor Memory: 28 Gb N50 Contig size: 16.3 Kb  24.8 Kb N50 Supercontig size:.265 Mb  16.9 Mb

Mouse Assembly

Why sequence more genomes?