Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genome Sequencing and Assembly

Similar presentations


Presentation on theme: "Genome Sequencing and Assembly"— Presentation transcript:

1 Genome Sequencing and Assembly

2 History of Sequencing What was the first fully sequenced nucleic acid?
Yeast trnA (alanine tRNA) Robert Holley 1965 Image: Wikipedia

3 History of Sequencing Sequencing began with RNA, not DNA
rRNA Oligomer Cataloging Uchida et al. 1974; Woese et al. 1975

4 History of Sequencing Sequencing Milestones
1965 – First nucleic acid sequenced: Yeast trnA 1976 – First complete genome sequenced (RNA virus: bacteriophage MS2) 1977 – First complete DNA genome (Phage Φ-X174) 1995 – First complete cellular genome (Haemophilus influenzae) and eukaryotic genome (yeast) sequenced 2001 – Publication of the first sequenced human genome 2016 – Todos Santos Genomics and Computational Biology Workshop first offered!

5 History of Sequencing Technological Advances
1975 – Plus and minus DNA sequencing method (Sanger and Coulson) 1977 – Maxam-Gilbert sequencing and Sanger DNA dideoxy terminator sequencing methods 1980s-1990s – Refinements to Sanger Sequencing Fluorescent labeling of ddNTPs Capillary electrophoresis Automated basecalling Polymerase chain reaction (PCR) 2005 – Introduction of 454 Sequencing and the NGS Revolution

6 Genome Assembly Image: Drew Sheneman

7 Genome Assembly Jigsaw Puzzle Genome Assembly Image: dreamstime.com

8 Genome Assembly Assembly Algorithms Overlap – Layout – Consensus (OLC)
e.g., Celera de Bruijn Graph e.g., ALLPATHS-LG, SPAdes, SOAPdenovo, Velvet

9 Genome Assembly Overlap – Layout – Consensus
Overlap: Alignment/comparison of ALL pairwise combinations of reads ACGTAGCTAGCATCGATCGATCGACTGATCGATCGATCGATCATC TAGCATCGATCGATCGACTGATCGTTCGATCGATCATCAGCATG Layout: Build contiguous sequences (contigs) by simplifying the network of observed overlaps Consensus: Determine sequence of contigs by eliminating ambiguities resulting from sequencing errors and/or nucleotide polymorphism.

10 Genome Assembly de Bruijn Graph (i.e., Network)
Does NOT require performing all possible pairwise comparisons between reads. Extract all k-mers from each read (i.e., subsequences of length k) to be the nodes in the graph. k-mers ACAGG CAGGA AGGAT GGATA GATAT Sequence Read ACAGGATAT k = 5

11 Genome Assembly de Bruijn Graph (i.e., Network)
Edges (i.e., connections) in the graph defined between k-mers that overlap by k - 1 bases within reads. k-mers (k = 5) ACAGG ATATG GATAC CAGGA TATGG ATACC AGGAT ATGGA TACCA GGATA TGGAT ACCAC GATAT GGATA CCACG Genome Read 1 Read 2 ACAGGATATGGATACCACG ACAGGATATGG GGATATGGATA TGGATACCACG ACAGG CAGGA AGGAT GGATA

12 Genome Assembly de Bruijn Graph (i.e., Network)
Edges (i.e., connections) in the graph defined between k-mers that overlap by k - 1 bases within reads. k-mers (k = 5) ACAGG ATATG GATAC CAGGA TATGG ATACC AGGAT ATGGA TACCA GGATA TGGAT ACCAC GATAT GGATA CCACG Genome Read 1 Read 2 ACAGGATATGGATACCACG ACAGGATATGG GGATATGGATA TGGATACCACG ACAGG CAGGA AGGAT GGATA

13 Genome Assembly de Bruijn Graph (i.e., Network)
Edges (i.e., connections) in the graph defined between k-mers that overlap by k - 1 bases within reads. k-mers (k = 5) ACAGG ATATG GATAC CAGGA TATGG ATACC AGGAT ATGGA TACCA GGATA TGGAT ACCAC GATAT GGATA CCACG Genome Read 1 Read 2 ACAGGATATGGATACCACG ACAGGATATGG GGATATGGATA TGGATACCACG TATGG ATGGA ATATG TGGAT GATAT ACAGG CAGGA AGGAT GGATA GATAC ATACC TACCA ACCAC CCACG

14 Genome Assembly Assessing Assembly Contiguity
Contig – A contiguous sequence of nucleotides produced by genome assembly ACGTCATCGATGCATGCATGACGATCGTAGCATG Scaffold – An assembled portion of a genome that can contain multiple contigs connected by structural information (e.g., paired-end reads, optical or genetic mapping, etc.) but separated by gaps. ACGTCATCGATGCATGCATGACGATCGTAGCATGNNNNNNNNNNACGATCGTAGCATCGATAACGT Contig (or Scaffold) N50 – The longest length such that the sum of all contigs (or scaffolds) of that length or longer account for 50% of the total assembly. Contig (or Scaffold) L50 – The smallest number of contigs (or scaffolds) that can account for 50% of the total assembly. Yes, N50 is a LENGTH and L50 is a NUMBER. It’s meant to confuse you!

15 Genome Assembly Assessing Assembly Contiguity Contig 1 – 1000 bp
Total bp What is the N50 for this assembly? What is the L50? 400 bp 3

16 Genome Assembly Examples of Short-Read Assemblers
ALLPATHS-LG: MaSuRCA: SOAPdenovo: SPAdes: Velvet:

17 Genome Assembly PacBio Coming into its Own
72x PacBio coverage (no Illumina data) Assembly of a 224 Mb plant genome with contig N50 of 2.4 Mb Image: Pacific Biosciences

18 Exercise Endosymbiotic Bacteria in Whiteflies Intracellular
Some obligate, some facultative Various functional roles, including synthesizing nutrients that are lacking in a plant-sap diet Candidatus Portiera aleyrodidarum is an ancient gamma-proteobacteral endosymbiont present in all whiteflies. Image: Gottlieb et al. 2010

19 Exercise ~/TodosSantos/velvet
Perform Velvet de novo assembly of bacterial genome using Illumina data Visualize genome assembly with Tablet Calculate assembly summary statistics with Perl script Repeat assembly with varying amounts of sequence coverage and different k-mer sizes


Download ppt "Genome Sequencing and Assembly"

Similar presentations


Ads by Google