Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.

Slides:



Advertisements
Similar presentations
Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland.
Advertisements

Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
Doug Brutlag 2011 Sequencing the Human Genome Doug Brutlag Professor Emeritus of Biochemistry.
SEQUENCING-related topics 1. chain-termination sequencing 2. the polymerase chain reaction (PCR) 3. cycle sequencing 4. large scale sequencing stefanie.hartmann.
9 Genomics and Beyond Brief Chapter Outline
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
DNA Sequencing – “Plus and Minus” Plus –Incubate with T4 DNA Polymerase and single dNTP –T4 Polymerase degrades 3’ ends in absence of dNTP –Fractionated.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Stuff to Do. Midterm I questions due 1/31 me your question (with answers), –if you have the capability, mail complete questions, figures, etc. and.
The Human Genome Race. Collins vs. Venter Collins Venter.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Genome sequencing and assembling
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Positional cloning: the rest of the story a a a a a a a a X.
Human Genome Project Seminal achievement. Scientific milestone. Scientific implications. Social implications.
Sequencing a genome (a) outline the steps involved in sequencing the genome of an organism; (b) outline how gene sequencing allows for genome-wide comparisons.
BioInformatics (2). Physical Mapping - I Low resolution  Megabase-scale High resolution  Kilobase-scale or better Methods for low resolution mapping.
Lecture 15 – Gene Cloning Based on Chapter 08 - Genomics: The Mapping and Sequencing of Genomes Copyright © 2010 Pearson Education Inc.
Presentation on genome sequencing. Genome: the complete set of gene of an organism Genome annotation: the process by which the genes, control sequences.
De-novo Assembly Day 4.
Mouse Genome Sequencing
Large-scale genome projects
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Genome Sequencing in the Legumes Le et al Phylogeny Major sequencing efforts Minor sequencing efforts ~14 MY ~45 MY.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
The Changing Face of Sequencing
Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Linkage and Mapping. Figure 4-8 For linked genes, recombinant frequencies are less than 50 percent.
Wageningen, April 24-25, 2008 II Tomato Finishing Workshop Chromosome 12 Update ENEA, Rome University of Naples ‘Federico II’ CRIBI and Univ. of Padua.
Human Genome.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter 5 Sequence Assembly: Assembling the Human Genome.
454 Genome Sequence Assembly and Analysis HC70AL S Brandon Le & Min Chen.
Plasmodium falciparum (3D7) - published in Draft coverage. No sequence updates for a year. No new annotation since? Leishmania major Friedlin - version.
Objectives: Outline the steps involved in sequencing the genome of an organism. Outline how gene sequencing allows for genome wide comparisons between.
Virginia Commonwealth University
Human Genome Project.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Genome sequence assembly
Pre-genomic era: finding your own clones
Very important to know the difference between the trees!
Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome.
Stuff to Do.
CSCI 1810 Computational Molecular Biology 2018
Introduction to Sequencing
Sequence the 3 billion base pairs of human
Human Genome Project Seminal achievement. Scientific milestone.
Presentation transcript:

Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy random shotgun (chromosome & whole genome) sheared gDNA libraries, physical maps not necessary, fast, whole genome coverage produced quickly, assembly may be problematic clone-by-clone (map-as-you-go) BAC, YAC, cosmid libraries & physical maps, slower, data produced less quickly from isolated regions procurement of DNA: library construction, test sequencing, analysis of data large-scale sequencing of libraries Assembly and data release for shotgun projects: at 3 X: first assembly, release of genome data at 5-6 X: ~97% genes sequenced at 8-10 X coverage, final assembly for clone-by-clone: sequence of clones released as completed Closure gap closure, repeat resolution, identification of mis-assemblies: time-consuming, expensive comparison to physical/genetic/optical maps Gene finding and annotation train gene finding algorithms and predict gene models genome annotation: auto-annotation vs manual annotation genome analysis, comparative genomics, publication, final data release to GenBank

Sequencing strategies for long DNA We can’t directly sequence long DNA (yet), but we can assemble the master sequence from smaller pieces.

Shotgun Library Construction & Sequencing Concept: 1)Shred long DNA into lots of random short fragments 2)Sequence both ends of the fragments 3)Reassemble the original DNA from overlapping sequences of the fragments SOUNDS EASY!

Methods: sonication syringe nebulization NOT RESTRICTION ENZYMES

Size-selected shotgun fragment Libraries Small insert library provides most of the sequence coverage (contigs) Large insert libraries help order the contigs (and scaffolds)

Mate pair (~1kb between) Mate pair (~9kb between) 5’ end read 3’ end read 5’ end read 3’ end read

Assembly of contigs from mate pairs must have high-quality (well-trimmed) input DNA, to reduce false overlaps reads must be mostly mate pairs (<25% single reads) library insert size variance must be kept low (<10%) for accurate prediction of distance between mate-pairs sequences

Scaffolds, or ‘Why we sequence mate pairs from longer fragments’ low-complexity/repetitive Knowing the sizes of inserts can tell us roughly what we don’t we don’t know (sometimes).

Scaffolds into chromosomes

- The average number of times any given base in the genome was sequenced (in this case, each base was read 8 times on average. Of course a particular base may have been read more or less than 8 times.) also -The amount of sequence that was obtained, relative to the length of the whole genome (in this case, the aggregate length of all reads was 8 times the genome length) Lander & Waterman (1988) determined that for an ideal genome project (no ‘difficult’ regions) 8X-10X coverage is sufficient to confidently complete the genome. Two ways of thinking about: COVERAGE What does “8X coverage” mean??

NO EUKARYOTIC GENOME IS THAT WELL-BEHAVED So even with 8X shotgun coverage there’s likely at least ~1% of the genome remaining to be finished, by more laborious and expensive means (The human genome…are we there yet??) Some genomes are relatively well-behaved: nearly all sequence reads were assembled into contigs  scaffolds  chromosomes, with relatively few or no gaps remaining (e.g., Plasmodium falciparum) Some genomes are very badly behaved and far from finished; reads may remain unassigned to contigs, much less scaffolds, much less chromosomes. There are lots of gaps (Ns) and lots of repeats. E.g., Trichomonas vaginalis genome: huge, highly repetitive, AT-rich; low-quality seq was allowed in to increase coverage/gene calls in ‘difficult’ regions..

Finishing Closure of gaps between contigs/scaffolds Correction of misassemblies resequencing of low-coverage/low-quality regions This is usually the most time-consuming part of the project. Repeat/low complexity regions can be hard to sequence and hard to know where to ‘put’ in the final assembly.

Sequence hierarchy genome (all chromosomes) Chromosome (one or more scaffolds..ultimately one contig!) Scaffold (two or more contigs) contig reads (mate-pair & single) overlapping, ordered sets, no gaps ordered sets w/gaps, size estimated Not biological entities ordered sets w/gaps

Post-sequencing steps Automated gene calling (setting boundaries) Annotation (guessing function) Manual refining gene models correcting annotation should be an ONGOING process…wish it was

OTHER STUFF (demonstrated on the websites) Adding columns Sorting (some are presorted) Gaps: more than one N (within scaffold, gap between scaffold), vs ambiguities (contig) (see P.falc) Chromosome as one giant contig…or one giant scaffold