Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics in the CDC Biotechnology Core Facility Branch

Similar presentations

Presentation on theme: "Bioinformatics in the CDC Biotechnology Core Facility Branch"— Presentation transcript:

1 Bioinformatics in the CDC Biotechnology Core Facility Branch
Computational Lab Scott Sammons Kevin Tang Chandni Desai Sequencing Lab Mike Frace Missy Olsen-Rasmussen Marina Khristova Lori Rowe

2 sequencing platforms – current and upcoming
Genome Sequencing Lab sequencing platforms – current and upcoming AB 3730XL Roche 454 Titanium + Illumina GA IIx Pacific Biosciences SMRT sequencer Ion Torrent Personal Gene Machine

3 Building 23 Server Room – Main ISLE
This slide is here only to show the quantity. It isn’t expected that these will all be talked about. Also, comment on how some are large longer-term and some are smaller shorter-term.

4 High Performance Computing Cluster (Aspen)
What is it? 35 compute nodes each with 12 processor cores, 48GB of memory, and 2 Tesla 2050 GPU cards Currently in the final stages of development in preparation for code-freeze and C&A What can it do today? 25 cluster applications are currently enabled for our phase-one deployment including MatLab, Geneious, Beast, Blast, and PacBio Collaboration with NCI via IAA will GPU scientific applications even further How fast is it? By example, a Blast job that takes over 60 hours to complete on our old cluster takes 2 hours on the new cluster* *NOT GPU OPTIMIZED CODE

5 Isilon What is it? What can it do today? What are we doing with it?
High speed, scalable, and redundant Network Attached Storage Currently in the process of being integrated with applications Connected to both the CDC network and the Aspen HPC cluster utilizing Infiniband What can it do today? It provides user workspace for end-users and HPC applications Solves the problem of being out of disk space on individual servers What are we doing with it? Data warehouse for all scientific equipment Central network share for all scientific users Integrating directly with ITSO’s Active Directory forest

6 Private Cloud What is it? What can it do today?
Support science through front-end and back-end services Implementation of virtualized infrastructure. Currently in the process of being deployed. What can it do today? Provide test environments for scientific projects Lay the foundation for hardware consolidation and migration What are we doing with it? Standardize platforms Centralize management Support ongoing growth within the scientific computing community while enabling science

7 Scientific Computing Infrastructure The Server Room
2 Linux High Performance Computing Clusters (~40 nodes each) 1 Genomics Cluster 4 Solaris Servers 12 Stand-Alone Linux Servers 1 Stand-Alone Database Server 5 Stand-Alone Windows Servers Virtualized Cluster with 15 VMs 3 NAS Devices 2 Tape Libraries 2 Dedicated IP Subnets One C&A addressing all legacy production hardware (NCEZID) with several in-process for systems currently under development (NCIRD)

8 GSL sequencing 2011 INFLUENZA NCIRD NCEZID CGH Haemophilus influenzae
Legionella pneumophila Legionella spp. Mycoplasma pneumonia Water cooling tower metagenomics Respiratory filter metagenomics Bat metagenomics Vibrio cholera Vibrio spp Cyclospora Bacillus anthracis Listera Yersinia pestis Brucella spp. Klebsiella pneumonia Junin virus Rift Valley Fever virus Lujo virus Marburg virus CCHF virus Lassa Fever virus Clinical sample metagenomics Tick metagenomics Soil metagenomics Guineaworm Taenia solium Angiostrongylus

9 Position of E-PCR overlapping amplicons
Sequencing: extended PCR Position of E-PCR overlapping amplicons A1 A3 A5 A7 A9 A11 A13 A15 A17 End-R End-L A2 A4 A6 A8 A10 A12 A14 A16 A18 D P O C E R K H M L I F N A S J B G Q HindIII map Primers designed using VAR-BSH and VAC-CPN sequences Primers target genes involved in reproduction & host response Sequence sample: primers 40 sites, 1 enz. RFLP ~120 sites PCR uses minimal DNA amounts, often no need to grow virus PCR uses hifi expand long-template Taq & Pwo enzymes (Roche)

10 First Pass Assembly: Seqmerge
16 12 8 4 fold redundancy

11 Sequencing Assembly: Phred/Phrap/Consed
Using the phred base caller, phrap assembler, and consed viewer, we are able to troubleshoot our sequencing projects and make intelligent calls in areas of ambiguity.

12 Gene Prediction Heuristic algorithm to assign quality scores to ORFs (from 1 to 100) Quality scores are based on a number of factors including Gene Predictions (glimmer, genemark, getorf) Primary sequence homology to known genes (BLAST) Presence of predicted promoter (MEME/MAST) Size of predicted ORF Presence of transcription terminal signals

13 Visualizing Gene Predictions and Differences

14 ORFs of CPVXs from 4 different clades
ITR crm-D

15 45 Smallpox Strains C-1. non-West-African-African int CFR ~10%
C-2. non-West-African African minor CFR <1% A. West African int. CFR ~10% C. Asian major CFR ~5 - 35% 1. Phylogenetic relationships of isolates into strains and variants agrees with allopatric distribution, not CFR (major and minor) >>> 2. A to C evolution B. American alastrim minor CFR <1%

16 Unrooted tree phylogenetic relationships of
ORF encoding the hemagglutinin protein VACLS1 Z99045 AY243312 AF377884 AF375102 Z99052 AF375096 AF375099 AF375112 AF375095 AF375113 AF375098 AY523994 AF229247 AF095689 M14783 AF375118 AF375119 AF375078 AY603355 AY366477 X94355 CPV91 ger3 AY902253 AF375084 AF375087 AY902252 AF375086 AY902304 AY902303 AF012825 Z99054 X69198 X65516 L22579 AF375135 AF375141 AF375143 JAP46 yam AF375142 AF375130 BRZ66 gar AF375138 AF375129 AF375093 AF375081 AY009089 AY902277 AF375085 AY902269 AF375090 AF377886 AF377878 AF377877 AY902260 AF375083 AY902283 AY902286 AY902301 AY902272 AY902299 AY902274 AY902275 AY902295 AF482758 AY902289 AY902294 AY902276 AY902257 AY902256 AY902268 AY902300 AY902308 AY298785 AY902270 AY902271 AY902287 AY902297 AY902288 AY902296 CPV90 ger2 AF375088 AF377885 AY902298 AF375077 AF375123 NC Cowpox clade IV CPXV90_ger2 Camelpox Taterapox Variola Ectromelia Cowpox clade III (CPXV91_ger3) Cowpox clade II Cowpox clade I Vaccinia Monkeypox

17 Next-Gen Diagnostic Sequencing Applications
Shotgun / Paired-End Sequencing: random shearing of DNA, even sequence coverage over entire genome. ‘Massively parallel’ sequencing not only produces throughput, it provides sequences of potentially millions of individual molecules (instant cloning). By sequencing a PCR reaction it allows the detailed search for low expression quasi-species or mutations which may signal growing drug or vaccine resistance – a process called ultra-deep or amplicon sequencing. Example: clinical case of poxvirus infection with samples exhibiting a reduced sensitivity to an antiviral drug. Complex clinical, laboratory or environmental samples can be sequenced to provide a diagnostic ‘snapshot’ of the resident organisms - an approach called metagenomic sequencing. Examples: tissue culture, soil

18 Shotgun / Paired-End Sequencing
De novo Assembly Newbler CLCBio Mira Geneious Velvet Celera Reference Mapping Newbler CLCBio Mosaik Mira Geneious BWA

19 Genome Assembly Visualization

20 Genome Assembly Visualization

21 Amplicon (deep) sequencing project
Li, Damon - NCZEID/DVRD/PRB Clinical case of progressive vaccinia infection from smallpox vaccination of an immune compromised patient Pox antiviral ST-246 administered which targets pox gene F13L, a major envelope protein which mediates production of extracellular virus Oral ST-246 given daily and vaccination site sampled over 3 week period

22 A region of gene F13L was amplified from clinical samples, deep sequenced,
and compared to the smallpox vaccine reference sequence (Acambis 2000) Control swab prior to ST-246

23 2 weeks after ST-246 T > A 943 C > T 869

24 3 weeks after ST-246 C > T 869 T > A 943

25 What is Metagenomics? Is the genomic study of DNA from uncultured microorganisms, generally from environmental samples Related Metatranscriptomics Metaproteomics

26 Sample Coverage Rarefaction Curves
Samples Wooley JC, Godzik A, Friedberg I, 2010 A Primer on Metagenomics. PLoS Comput Biol 6(2)

27 Classification Techniques
Supervised Taxonomic Classification Homology-based Database searching by similarity (BLAST, SW) BLAST, BLASTX: genbank, specialized DBs: NCBI-ENV-NT, NCBI-ENV-NR Composition-based N-mer frequency Markov Models, Support Vector Machines (SVM), need training set Unsupervised Taxonomic Classification Clustering methods SOM - self-organizing maps PCA – principal component analysis

28 Viral Metagenomic Pipeline (Wash U scripts implemented at CDC)
Sample Collection DNA Library Construction Sequencing Basecalling Vector Trimming Assembly Contigs, Reads Remove redundant sequences Unique sequences Mask repetitive and low complexity seqs Good sequences BLASTN against Human Genome (e ≤ 1e-10) Non-human sequences BLASTN vs nt BLASTX vs nr BLASTN vs GB-viral Report Generation, Display in MEGAN, inspect top hits

29 Software for Taxonomic Classification
MEGAN – GUI interface for classification based on blast searches CARMA web-based classification using pFam database and HMMER alignment of protein families MG-RAST classification system utilizing protein encoding databases and several ribosomal DBs. Can analyze user provided datasets, web use only Geneious – commercial product NextGENe – commercial product Phymm, PhymmBL – composition based classification system

30 Software for Comparative Metagenomics
Megan – can display two metagenome populations on the same phylogenetic tree, uses BLAST file as input STAMP – calculates statistical differences between sets of metagenomes XIPE-TOTEC – performs pairwise comparisons of every metagenome in the two sets, creates a distance matrix which is then used for clustering and PCA analysis to calculate statistical values of relatedness

31 Megan


33 Ugandan Outbreak Samples
4 patients Total RNA from patient sera 2 samples per 454 run ~ 565,000 reads/sample, avg length = 235nt Sequences were screened for random library amplication primers and low quality Assembled each run de novo using the 454 gsAssembler Performed a blastx database search using the assembled contigs (overnight) Visualized the blast output using MEGAN.

34 MEGAN (MetaGenomeANalyzer)

35 Ugandan Outbreak - results
Run1 - 5 contigs (out of 2463 > 100nt) matched YF virus, covering 98% of the genome (10,441 of 10,823bp) Mapped each sample from Run1 using an Ethiopian YF virus as reference individual reads from Sample 1 indentified as YF. Run 2 – no YF reads found

36 Phylogenetic analysis of yellow fever virus sequences
Laura McMullan (DHPP/VSPB)

37 Comparative Metagenomics – current work
One 454 run Two samples Sample 1 – ~578,000 reads, avg read length 438 bases Sample 2 – ~550,000 reads, avg read length 425 bases Total number of bases sequenced - ~488,000,000

38 Sample 1 – Rarefaction Curve

39 Sample 1 Taxa tree (collapsed at the Order level)

40 Comparison of Sample 1 and 2

41 Bioinformatics Tools Bioinformatics Packages EMBOSS BioInquiry
General Tools Java/BioJava Perl/BioPerl BLAST Suite BioEdit GFFtoPS Genome Comparison/Alignment Tools Mavid Mauve Clustal Muscle Gene Prediction Glimmer GeneMark Assembly/Mapping Tools 454 Suite Mosaik Tools Mummer CLC Bio BWA Velvet AHA (pacbio) Functional Annotation Manatee Phylogenetics Paup Phylip MrBayes Beauti/Beast MEGA DnaSP Metagenomics MEGAN Galaxy Carma In-House WAMS POCs/VOCs

42 Challenges Data Management – image files are large (1 run ~25G) moving these files around the network is slow Assembly/Mapping Software – Some are provided with the instrument, but additional methods and algorithms are needed Finishing Tools – gap filling, primer design Visualization Tools – tools to graphically display contigs on reference sequence as well as genome multiple alignments Generic Robust Annotation Tools – Researchers need tools to intelligently choose predicted ORFs as genes, assign function, and submit to GenBank


44 What are the weaknesses of current next-gen sequencers?
Complicated and time consuming library preparation Requires micrograms of DNA to begin 3 days to prepare library Requires amplification of library Low copy number polymorphisms may be missed Emulsion PCR is an inefficient, time consuming, oily mess Potential to introduce PCR bias into sample Instruments require repetitive sequential ‘flows’ of reagents Repetitive flows of nucleotides, blocking/unblocking chemistry, washing out reaction byproducts all slow synthesis and hinder read-length Consumes liters of reagents ($) Repetitive flows and imaging extend sequence runs to days (or weeks)

45 Pacific Bioscience SMRT sequencer
(single-molecule sequencer) Ion Torrent Personal Gene Machine (solid-state sequencer) Nanopore sequencing

46 Pacific Biosciences SMRT sequencer
Sponsor: Influenza Research Agenda

47 Pacific Biosciences SMRT Technology
Individual ZMW with attached polymerase and DNA strand Laser excitation/detection volume glass  ~ 50 nm Functional volume (red) is in zL! SMRTcell = 160,000 ZMW SMRTcell array = 1.5 million ZMW

48 Nucleotide incorporation is a realtime data movie
100 ms

49 Pacific Biosciences Advantages
Read lengths of 1,000 – 10,000 bases No reagent ‘flows’ =10-fold increase in sequencing speed Substitute reverse transcriptase for polymerase and sequence RNA directly Bacteria genomes sequenced in hours Sequence run costs 99$; take 15 minutes to complete 4


51 454 Sequencing DNA Library Prep emPCR Amplification Sequencing
Data Analysis

52 454 Sequencing: DNA Prep Nebulization Repair Ends Adaptor Ligation
sheared with high pressure nitrogen to create fragments ~ bases long Repair Ends double stranded pieces are purified, blunt ended, and phosphorylated Adaptor Ligation two different adaptors are ligated to the fragment, A and B 44 bases long: 20 base PCR primer, 20 base sequencing primer, 4 base key B fragment contain a biotin tag for immobilization This forms 4 different strands A-A, A-B, B-A, B-B Fragment Immobilization These immobilized on streptavidin-coated magnetic beads, A-A strands will not bind and are washed away Single-strand Isolation bound fragments are denatured and the released strands (containing both an A and a B tag) form a single-stranded template DNA library

53 454 Sequencing: emulsionPCR
Emulsion-based clonal PCR Annealing Fragments are annealed to primer tagged “catcher” beads optimized to anneal a single strand to a single bead Distribution in a water-oil-emulsion the captured dna and beads along with amplication reagents are placed in a water-oil mixture Each bead is captured in a “bubble” and creates its’ own small “micro-reactor” thermocyled creating millions of copies of a single clonal fragment in individual “microreactors” cleaned up and denatured

54 454 Sequencing: Sequencing by Synthesis
Bead Preparation - sequencing primer attached and polymerase and cofactors are added Bead Deposition – beads are layered on a picotiter plate (wells are 44 μm), then enzyme beads and packing beads are added

55 454 Sequencing: Sequencing by Synthesis (cont.)
enzyme beads contain sulfurylase and luciferase, packing beads help keep reaction beads in position a fluidics system delivers sequencing reagents, flowing the nucleotides one at a time in a specific order across the wells

56 454 Sequencing: Sequencing by Synthesis (cont.)
if a nucleotide is incorporated, a pyrophosphate is released which is converted to ATP by the sulfurylase the ATP is hydrolyzed by the luciferase enzyme producing oxyluciferase and light The light emission is measured with a CCD camera light intensity indicates nucleotide incorporation

57 454 Sequencing: Sequencing by Synthesis (cont.)
Characteristics Flow of the four nucleotides is repeated for one hundred cycles, resulting in average read length of bases system averages ~1,000,000 high quality wells therefore, a typical run yields over 400 million high quality bases

58 454 Sequencing: Paired End Protocol

Download ppt "Bioinformatics in the CDC Biotechnology Core Facility Branch"

Similar presentations

Ads by Google