Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards your own genome. Designing your Sequencing Run https://genohub.com/next-generation-sequencing-guide/ Sequencing strategy Genome size and genome.

Similar presentations


Presentation on theme: "Towards your own genome. Designing your Sequencing Run https://genohub.com/next-generation-sequencing-guide/ Sequencing strategy Genome size and genome."— Presentation transcript:

1 Towards your own genome

2 Designing your Sequencing Run https://genohub.com/next-generation-sequencing-guide/ Sequencing strategy Genome size and genome complexity?! related organism, PFGE, flow cytometry

3 Noncoding DNA in genomes

4 Repetitive DNA in the human genome

5 Sequencing strategy Template and Library prep: Fragment (SE),Paired-end (PE)or Mate pair (MP) BAC clones, fosmids.... Sequencing Platform

6 Method Single-molecule real- time sequencing (Pacific Bio) Ion semiconductor (Ion Torrent sequencing) Pyrosequencing (454) Sequencing by synthesis (Illumina) Sequencing by ligation (SOLiD sequencing) Chain termination (Sanger sequencing) Read length2900 bp average [38] 200 bp700 bp50 to 250 bp50+35 or 50+50 bp400 to 900 bp Accuracy87% - 99%98%99.9%98%99.9% Reads per run35-75 thousandup to 5 million1 millionup to 3 billion1.2 to 1.4 billionN/A Time per run30 minutes to 2 hours2 hours24 hours 1 to 10 days, depending upon sequencer 1 to 2 weeks 20 minutes to 3 hours Cost per 1 mil. bases $2$1$10$0.05 to $0.15$0.13$2400 Advantages Longest read length. Fast. Detects 4mC, 5mC, 6mA. [41] Less expensive equipment. Fast. Long read size. Fast. Potential for high sequence yield, depending upon sequencer model and desired application. Low cost per base. Long individual reads. Useful for many applications. Disadvantages Low yield at high accuracy. Equipment can be very expensive Homopolymer errors. Runs are expensive. Homopolymer errors. Short reads. Slower than other methods. More expensive and impractical for larger sequencing projects. Genome sequencing: Comparison of NGS methods

7 InstrumentApplication: de novo assemblies BACs, plastids, & microbial genomesTranscriptomePlant & animal genome 454 – GS Jr.B – good but expensiveC – need multiple runs, expensiveD – cost prohibitive 454 – FLX+A – good, need to multiplex to be economical B – good but expensive, libraries usually normalized, not best for short RNAs C – OK as part of a mixed platform strategy, prohibitive to use alone MiSeq – v2A – good, need to multiplex for best economics A/B –expensive for rare transcripts (compared to HiSeq), but reads are longer for better assembly B – expensive relative to HiSeq, but additional read length can be valuable HiSeq 2000/2500, standard run B/C – more data than needed unless highly indexed; assembly more challenging than 454 or MiSeq A – good, assembly more challenging than 454 but much more data available for analyses A – primary data type in many current projects; requires mate-pair libraries HiSeq 2500, rapid run (projected) B – more data than needed unless highly indexed; assembly more challenging than 454 A – good, assembly more challenging than 454 but much more data available for analyses A – will probably be more expensive than HiSeq2000, but increased read length may be worth it Ion Torrent – 314 B/C – OK, lowest experimental cost but reads are shorter & more expensive than Illumina C – OK, but reads are shorter & more expensive than Illumina D – cost prohibitive, reads shorter than alternatives Ion Torrent – 318B/A – good, less data than MiSeq B/A – good, less data than MiSeq, reads similar to 454 titanium but less expensive C – high cost relative to Proton or Illumina, more economical than 454 for mixed platform strategy Ion Torrent Proton I B – more data than needed unless indexed; assembly more challenging than 454 or Illumina B/A – assembly currently more challenging than Illumina or 454 B – expensive relative to HiSeq or Proton II/III Ion Torrent Proton II (projected) B/C – more data than needed unless highly indexed; assembly more challenging than 454 or Illumina B/A – assembly currently more challenging than Illumina or 454 A/B – should be similar to HiSeq Ion Torrent Proton III (forecast) C – more data than needed unless highly indexedB/A – need assembly pipelines A – cost per MB could make it the best SOLiD – 5500 C – more data than needed unless highly indexed; assembly more challenging than 454 or Illumina C/D – short reads make assembly challenging or impossible PacBio – RS B – good for hybrid assemblies; not economical for solo assemblies – requires high coverage due to high error rates B/D – good for hybrid assemblies; too expensive for solo use; short RNA is challenging B/D – good for hybrid assemblies & scaffolding (mixed platform strategy); cost prohibitive for solo use

8 Platform – instrumentApplication: resequencing Targeted lociTranscript countingGenome resequencing 454 – GS Jr. B/C – good but expensive, need to limit loci D – cost prohibitive D – cost prohibitive for large genomes 454 – FLX+B – good but expensive, should limit lociD – cost prohibitive D – cost prohibitive for large genomes MiSeq A/B – good, fewer and higher cost reads than HiSeq B – more expensive than HiSeq or SOLiD or ProtonII+ B/C – expensive for large genomes HiSeq 2000/2500 – standard run A – primary data type in many current projects; best for many loci A – primary data type in many current projects HiSeq 2500 – rapid run (projected) A – faster path to leading data type A/B – likely to be slightly more expensive than with standard flow cell A – faster path to leading data type Ion Torrent – 314C – OK but expensive, need to limit lociD – cost prohibitive Ion Torrent – 318 B – good, slightly less data per run than MiSeq B/C – more expensive than HiSeq or SOLiD; new informatics pipelines needed; new error profile C – expensive for large genomes Ion Torrent Proton I A/B – similar to MiSeq, but different error profile will inhibit switching B – more expensive than Illumina or SOLiD; new informatics pipelines needed (different error profile than Illumina) B – expensive relative to HiSeq or Proton II+ Ion Torrent Proton II (projected) A/B – similar to HiSeq, but different error profile will inhibit switching A/B – new informatics pipelines needed A – supposed to set new pricing standard, could become leading shorter-read platform Ion Torrent Proton III (forecast) A/B – costs projected to be better than HiSeq; error profile different than Illumina A/B – new informatics pipelines needed A – supposed to set new pricing standard, could become leading shorter-read platform SOLiD – 5500xlB – harder to assemble than IlluminaA/B – used much less than HiSeq PacBio – RS C/D – expensive but can sequence difficult regions D – cost prohibitive C/D – cost prohibitive except for strutural variants

9 Bacterial genomes

10 Noncoding DNA in genomes

11 Bacterial genomes

12

13

14 Complex Bacterial Genomes Fosmid and plasmid library; Sanger

15 Simplified Bacterial Genomes MDA for 16h on one lysed cell 3kb Sanger libraries plus 454 15 gaps (chimeric clones) Sanger finishing Polishing by Illumina reads 37 regions Sanger polishing 454 (average read length 225bp) Illumina (33bp)

16 Bacterial genomes

17 Eukaryotic Genomes

18 Eukaryotic Genomes: Fish genomes Template: A female fish was chosen because of its XX sex chromosome constitution Roche 454 Titanium (3 and 20kb libraries) Illumina PE insert size 200bp and 75 bp reads physical map: fingerprints with ABI3730 from the WLC-1247 BAC library (insert size of 160 kb; 10× genome coverage with a total of 43,192 clones available)

19 Bird genomes

20 Mammalian genomes HiSeq2000 DNA isolated from blood

21 Extremelly large genomes loblolly pine (Pinus taeda) The largest genome assembled to date DNA template: a single megagametophyte, the haploid tissue of a single pine seed – quantity long-fragment mate pair libraries from the parental diploid DNA Novel fosmid DiTag libraries N50 scaffold size of 66.9 kbp

22 Raw Data Trimming and Filtering Quality score

23 Raw Data Trimming and Filtering

24

25 Assembly N50 N75 Contigs Scaffolds

26 Assembly: K-mer A common sequence shared by pairs of reads

27 Assembly: K-mer

28 Assembly

29 Assembly – algorithms Repeats! OLC Overlap/Layout/Consensus Overlap: Overlap discovery all-against-all, seed & extend heuristic algorithm; K-mers as alignment seeds-sensitivity Layout: Construction and manipulation of an overlap graph leads to an approximate read layout Consensus: Multiple sequence alignment (MSA) determines the precise layout and then the consensus sequence. Loading base calls-computer memory

30 Assembly vs Repetitive DNA

31

32 Assembly vs Repetitive DNA and Coverage Why is coverage important? resolution repeat discovery, copy number estimation binning of metagenomic data

33 Why is GC important? affecting coverage HGT discovery binning of metagenomic data Assembly vs GC content both GC-rich fragments and AT-rich fragments are underrepresented in the Illumina sequencing results

34 Assembly vs GC content Less even coverage with Illumina

35 Velvet and Velvet Optimizer Newbler Celera MaSuRCA Assembling algorithms and Scaffolders http://en.wikipedia.org/wiki/Sequence_assembly

36 Assembling algorithms and Scaffolders

37 Annotation

38 Ready for Annotation? Checking gene coverage: UCOs - Ultra Conserved Orthologs (Kozik et al., 2007) CEGMA - Core Eukaryotic Genes Mapping Approach (Parra et al., 2007) SICO - genes Single Copy genes Proteobacteria (Lerat et al., 2003) Median gene length roughly proportional to genome size Percent gaps: library insert size vs. 50 “N”s

39 Sanger 454 Illumina Ready for Annotation? UCOs

40 Annotation of Prokaryotic Genomes Automated pipelines and annotation softwares: RAST BASys SOP PROKKA IMG ER Gene prediction: GLIMMER Prodigal Prokaryotic Dynamic Programming Genefinding Algorithm

41 Annotation of Prokaryotic Genomes Repeated errors Inconsistent gene names Additional data and postgenomic experiments

42 Annotation of Eukaryotic Genomes Standard draft assembly High quality draft assembly Two phases 1. computation phase repeat masking (homopolymers, transposable elements) evidence alignment (proteins, ESTs, RNA-seq data aligned) ab initio gene prediction vs Evidence driven gene prediction 2. annotation phase finding a consensus

43 Annotation of Eukaryotic Genomes Gene prediction and gene annotation are not synonyms! Predictors do not report untranslated regions (UTRs) or splice variants

44 Annotation of Eukaryotic Genomes


Download ppt "Towards your own genome. Designing your Sequencing Run https://genohub.com/next-generation-sequencing-guide/ Sequencing strategy Genome size and genome."

Similar presentations


Ads by Google