Presentation is loading. Please wait.

Presentation is loading. Please wait.

National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core.

Similar presentations


Presentation on theme: "National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core."— Presentation transcript:

1 National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core Facility Branch Computational Lab Scott Sammons, Team Lead Kevin Tang Kristen Knipe Sequencing Lab Mike Frace, Team Lead Lori Rowe Marina Khristova Mark Burroughs Milli Sheth

2 PacBio SMRT sequencer Ion Torrent PGM Illumina MiSeq Roche 454 Titanium + Illumina GA IIx Genome Sequencing Lab sequencing platforms Illumina 2500

3 3 Building 23 Server Room – Main ISLE

4 4 High Performance Computing Cluster (Aspen) What is it? 35 compute nodes each with 12 processor cores, total of 420 cores, 110GB of memory, and 2 Tesla 2050 GPU cards What can it do today? 40 cluster applications are currently enabled including MatLab, Beast, MrBayes, Blast, MPI Blast, PacBio analysis tools, Celera Assembler, CLC Server, Geneious Server

5 5 Isilon What is it? High speed, scalable, and redundant Network Attached Storage Connected to both the CDC network and the Aspen HPC cluster utilizing Infiniband Total of 500TB usable space What can it do today? It provides user workspace for end-users and HPC applications Solves the problem of being out of disk space on individual servers What are we doing with it? Data warehouse for all scientific equipment Central network share for all scientific users Integrating directly with ITSO’s Active Directory forest

6 6 Private Cloud What is it? Support science through front-end and back-end services Implementation of virtualized infrastructure Currently in the process of being deployed What can it do today? Provide test environments for scientific projects Lay the foundation for hardware consolidation and migration What are we doing with it? Standardize platforms Centralize management

7 Sequencing Lab Origins Began in 2001 Mission: sequence 8 human smallpox viruses before the WHO revisits destruction of all smallpox stocks By 2005, had sequenced over 150 smallpox and related poxvirus genomes. 2006: Roche 454, focus moved to small bacterial genomes 2010: Illumina GAIIx 2011: Ion Torrent, PacBio

8 Position of E-PCR overlapping amplicons A2A4A6A8 A10 A12A14A16A18 A1 A3A5 A7 A9 A11 A13 A15A17 End-R DPOCEKHMLIFNAJBGSR Q End-L H Primers designed using VAR-BSH and VAC-CPN sequences H Primers target genes involved in reproduction & host response H Sequence sample: primers 40 sites, 1 enz. RFLP ~120 sites H PCR uses minimal DNA amounts, often no need to grow virus H PCR uses hifi expand long-template Taq & Pwo enzymes (Roche) HindIII map Sequencing: extended PCR

9 Sequencing Assembly: Phred/Phrap/Consed

10 Gene Prediction Heuristic algorithm to assign quality scores to ORFs (from 1 to 100) Quality scores are based on a number of factors including –Gene Predictions (glimmer, genemark, getorf) –Primary sequence homology to known genes (BLAST) –Presence of predicted promoter (MEME/MAST) –Size of predicted ORF –Presence of transcription terminal signals

11 Visualizing Gene Predictions and Differences

12 ITR crm-D ORFs of CPVXs from 4 different clades

13 B. American alastrim minor CFR <1% C-1. non-West- African-African int CFR ~10% C. Asian major CFR ~5 - 35% A. West African int. CFR ~10% C-2. non-West- African African minor CFR <1% 45 Smallpox Strains

14 Unrooted tree phylogenetic relationships of ORF encoding the hemagglutinin protein

15 GSL sequencing 2013 NCEZID NCIRD NCHHSTP Vibrio cholera Vibrio spp Camphylobacter Salmonella Bacillus anthracis Listera Bukholderia spp Yersinia pestis Brucella spp. Klebsiella pneumonia Fungal Meningiditis Rift Valley Fever virus Lujo virus Marburg virus CCHF virus Lassa Fever virus Clinical sample metagenomics Haemophilus influenzae Legionella pneumophila Legionella spp. Mycoplasma pneumonia Water cooling tower metagenomics Respiratory filter metagenomics Bordetella spp. Tick metagenomics Neiseria spp Hepatitis Mycobacterium tuberculosis INFLUENZA CGH Rhodoccocus Cryptosporidium Fasciola spp Balamuthia spp

16 Next-Gen Diagnostic Sequencing Applications ‘Massively parallel’ sequencing not only produces throughput, it provides sequences of potentially millions of individual molecules (instant cloning). By sequencing a PCR reaction it allows the detailed search for low expression quasi-species or mutations which may signal growing drug or vaccine resistance – a process called ultra-deep or amplicon sequencing. Example: clinical case of poxvirus infection with samples exhibiting a reduced sensitivity to an antiviral drug. Complex clinical, laboratory or environmental samples can be sequenced to provide a diagnostic ‘snapshot’ of the resident organisms - an approach called metagenomic sequencing. Examples: tissue culture, soil, blood serum, sputum, stool Shotgun / Paired-End Sequencing: random shearing of DNA, even sequence coverage over entire genome.

17 Shotgun / Paired-End Sequencing De novo Assembly Newbler CLCBio Mira Geneious Velvet Celera Assembler Reference Mapping Newbler CLCBio Mira Geneious BWA Bowtie

18 Genome Assembly Visualization

19

20 Genome Comparison

21 HGAP – Hierarchical Genome Assembly Process PreAssembly –Generation of long accurate reads Assembly –Choice of assemblers, but OLC (Overlap Layout Consensus) are best, MIRA and Celera Assembler Consensus Polishing –Quiver – a quality aware consensus algorithm maps all reads back to the assembly and creates a new consensus

22 HGAP: PreAssembly 30X

23 HGAP: PreAssembly/Assembly Correct seed reads with short reads Assemble with Celera Assembler or MIRA

24 HGAP - Quiver To reduce the remaining InDel and base substitution errors in the draft assembly, we use the PacBio Quiver, a quality-aware consensus algorithm. Four different per-base Quality Values (QV scores) represent the intrinsically calculated error probabilities for inserted, deleted, substituted and merged base calls in single pass reads. These values allow Quiver to generate a highly accurate consensus for the final assembly, which frequently exceeds QV50 (99.999% accuracy).

25 HGAP Example

26 HGAP Confirmation with Physical Mapping

27 HGAP Assembly Structural Confirmation

28 HGAP Sequence Confirmation with Illumina reads

29 Amplicon (deep) sequencing project Clinical case of progressive vaccinia infection from smallpox vaccination of an immune compromised patient Pox antiviral ST-246 administered which targets pox gene F13L, a major envelope protein which mediates production of extracellular virus Oral ST-246 given daily and vaccination site sampled over 3 week period Li, Damon - NCZEID/DVRD/PRB

30 A region of gene F13L was amplified from clinical samples, deep sequenced, and compared to the smallpox vaccine reference sequence (Acambis 2000) Control swab prior to ST-246

31 2 weeks after ST-246 C > T 869 T > A 943

32 3 weeks after ST-246 C > T 869 T > A 943

33 What is Metagenomics? Is the genomic study of DNA from uncultured microorganisms, generally from environmental samples Related Metatranscriptomics Metaproteomics

34 Sample Coverage Rarefaction Curves Wooley JC, Godzik A, Friedberg I, 2010 A Primer on Metagenomics. PLoS Comput Biol 6(2) Samples

35 Classification Techniques Supervised Taxonomic Classification Homology-based Database searching by similarity (BLAST, SW) BLAST, BLASTX: genbank, specialized DBs: NCBI-ENV-NT, NCBI-ENV-NR Composition-based N-mer frequency Markov Models, Support Vector Machines (SVM), need training set Unsupervised Taxonomic Classification Clustering methods SOM - self-organizing maps PCA – principal component analysis

36 Contigs, Reads Remove redundant sequences Unique sequences Mask repetitive and low complexity seqs Good sequences BLASTN against Human Genome (e ≤ 1e-10) Non-human sequences BLASTN vs nt BLASTX vs nr Sample Collection DNA Library Construction Sequencing Basecalling Vector Trimming Assembly Viral Metagenomic Pipeline (Wash U scripts implemented at CDC) Report Generation, Display in MEGAN, inspect top hits BLASTN vs GB-viral

37 Megan

38

39 Ugandan Outbreak Samples 4 patients Total RNA from patient sera 2 samples per 454 run ~ 565,000 reads/sample, avg length = 235nt Sequences were screened for random library amplication primers and low quality Assembled each run de novo using the 454 gsAssembler Performed a blastx database search using the assembled contigs (overnight) Visualized the blast output using MEGAN.

40 MEGAN (MetaGenomeANalyzer)

41 Ugandan Outbreak - results Run1 - 5 contigs (out of 2463 > 100nt) matched YF virus, covering 98% of the genome (10,441 of 10,823bp) Mapped each sample from Run1 using an Ethiopian YF virus as reference individual reads from Sample 1 indentified as YF. Run 2 – no YF reads found

42 Phylogenetic analysis of yellow fever virus sequences Laura McMullan (DHPP/VSPB)

43 Comparative Metagenomics One 454 run Two samples Sample 1 – ~578,000 reads, avg read length 438 bases Sample 2 – ~550,000 reads, avg read length 425 bases Total number of bases sequenced - ~488,000,000

44 Sample 1 – Rarefaction Curve

45 Sample 1 Taxa tree (collapsed at the Order level)

46 Comparison of Sample 1 and 2

47 Bioinformatics Tools Bioinformatics Packages –EMBOSS –CLCbio –Geneious –LaserGene-Ngen –Galaxy General Tools/ Languages –Java/BioJava –Perl/BioPerl –R –BLAST Suite –BioEdit Genome Comparison/Alignment Tools –Mavid –Mauve –Clustal –Muscle –MAFFT Gene Prediction –Glimmer –GeneMark Assembly/Mapping Tools –454 Suite –Mosaik Tools –Mummer –BWA –Velvet –AHA (pacbio) Functional Annotation –Manatee Phylogenetics –Paup –Phylip –MrBayes –Beauti/Beast –MEGA –DnaSP Metagenomics –MEGAN –Galaxy –Carma In-House –WAMS –POCs/VOCs

48 Challenges Data Management – image files are large moving these files around the network is slow Assembly/Mapping Software – Some are provided with the instrument, but additional methods and algorithms are needed Finishing Tools – gap filling, primer design Visualization Tools – tools to graphically display contigs on reference sequence as well as genome multiple alignments Generic Robust Annotation Tools – Researchers need tools to intelligently choose predicted ORFs as genes, assign function, and submit to GenBank

49

50 What are the weaknesses of current next-gen sequencers? Complicated and time consuming library preparation Requires amplification of library Instruments require repetitive sequential ‘flows’ of reagents Requires micrograms of DNA to begin 3 days to prepare library Low copy number polymorphisms may be missed Emulsion PCR is an inefficient, time consuming, oily mess Potential to introduce PCR bias into sample Repetitive flows of nucleotides, blocking/unblocking chemistry, washing out reaction byproducts all slow synthesis and hinder read-length Consumes liters of reagents ($) Repetitive flows and imaging extend sequence runs to days (or weeks)

51 Pacific Bioscience SMRT sequencer (single-molecule sequencer) Ion Torrent Personal Gene Machine (solid-state sequencer) Nanopore sequencing

52 Pacific Biosciences SMRT sequencer Sponsor: Influenza Research Agenda

53 Individual ZMW with attached polymerase and DNA strand Laser excitation/detection volume glass  Pacific Biosciences SMRT Technology ~ 50 nm Functional volume (red) is in zL! SMRTcell = 160,000 ZMW SMRTcell array = 1.5 million ZMW

54 Nucleotide incorporation is a realtime data movie 100 ms

55 Pacific Biosciences Advantages  Read lengths of 1,000 – 10,000 bases  No reagent ‘flows’ =10-fold increase in sequencing speed  Substitute reverse transcriptase for polymerase and sequence RNA directly  Bacteria genomes sequenced in hours  Sequence run costs 99$; take 15 minutes to complete


Download ppt "National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core."

Similar presentations


Ads by Google