3 Evolution of genomic technologies Genetic mapping studies: Discovery of genes for well characterized Mendelian diseases.Dense SNP genotyping using microarray technology: GWAS for discovery of common variants in common diseaseHigh throughput sequencing: Discovery of rare variants in not previously recognized Mendelian diseases.The genomic technologies continues to change at a rapid pace, especially in the past 20 years. These technological advancement can be grouped in the three eras.. The development of complete genetic maps of the human genome in the 1980’s fueled the mapping of Mendelian loci in extended kindreds for dominant traits and predominantly in consanguineous kindreds for recessive traits. Further accelerated by the acquisition of the sequence of the human genome in 2001, this first Mendelian era identified over 2800 disease loci and profoundly changed our understanding of the biology and pathophysiology of every organ system. Labor intensive and slow process.A second era, was defined by the development of microarray technology and identification more that 10 million common variants in human genome. The microarrays were developed to genotype 500K to 5 Million SNPs in order to identify common variants associated with human disorders This era led to the identification of more thant 1000 loci that shows robust association with human disease that have changed the understanding of disease biology..We have recently entered a third era of discovery, this one driven by spectacular reductions in the cost of DNA sequencing from ~$100,000 per million bases in 1998 to ~$0.10 today on the HiSeq instrument. Coupled with our development of robust methods for selectively sequencing complete coding regions of the genome, which harbor the overwhelming majority of Mendelian loci, and analytic methods to rapidly and with high sensitivity and specificity identify variations from the reference sequence, one can now sequence ostensibly all the genes in the human genome (the exome) to high levels of completion for ~$1000 (direct cost). This has provided fundamental new opportunities for identifying Mendelian loci that were previously elusive.Restriction length polymorphism, breeding experiment, linkage studies using satelite markers etc
4 Why High-Throughput DNA Sequencing Number of PubMed Articles DNA sequencing can provide a deeper understanding about DNA/RNA than any other technologyMicroarray Technology revolutionized biomedical research, but has several limitations, which DNA sequencing may overcomeAs the cost of sequencing is rapidly decreasing, it is becoming affordable to perform sequencing at a genome levelWhy do we need highthroughput DNA sequencing Center at Yale? There is no doubt that microarray technology has revolutionalized the biomedical research but has several limitations such as indirect observation based on hybridization signals which can non specific due to cross hybridization and also is not sensitive enough to identify low levels of chages. Also microarrays can provide the information about what is representated on the chips . DNA sequencing may be able to over come some of these limitations to provideNumber of PubMed ArticlesIn recent years there has been an explosion of research articles using next generation sequencing technologies
5 Applications of High-throughput DNA Sequence Analyses DNA Sequencing ApplicationsRe-SequencingMutation/SNP discovery and profilingInteractomeDNA Protein InteractionsChIP SeqTranscriptome AnalysisAlternative splicing and allele specific expressionmicroRNAExpression and DiscoveryClinical diagnosisEpigenomicsDNA MethylationDe Novo SequencingPopulation MetagenomicsCopy Number Variation
6 First Generation: Sanger sequencing ( )1980 Nobel Prize in chemistryphi X 174~5300 bpgels read by handradiolabeled dideoxyNTPsone lane per nucleotide800 bp readslow throughput (several kb/gel)
7 Massively parallel sequencing of millions of template Second-generation sequencing:Massively parallel sequencing of millions of templateIlluminaIon Torrent-Proton
8 Second Generation: Massively Parallel Sequencing. Throughput (24 hours): Mb (Sanger)60,000- 1,200,000 Mb (HiSeq X Ten)Cost: $1500/Mb (Sanger)$0.04 /Mb (HiSeq 2500 and X)Read Lengths: ~800 bp (Sanger)~ 100 – 600 (HiSeq)Error rates: < 0.5 % (Sanger)~ 0.8 % (HiSeq)
9 Illumina next-generation sequencing platforms MiSeqHiSeq 2000HiSeq 2500NextSeq 500HiSeq X Ten
10 Comparison of MiSeq, NextSeq, HiSeq 2500 and HiSeq X Ten Sequencing MiSeqNextSeqHiSeq 2500HiSeq X TenFocused powerFlexible PowerProduction PowerPopulation-scale whole human genome sequencing at $1000/genomeMid OutputHigh OutputRapid RunOutput/run Gb3 to 1520-4030-12020-360100-2,0003,200-3,600Reads/run25 M130M400M600M4,000M6,000 MRun times5-65 hrs.15-26 hrs.12-30 hrs.7-40 hrs.1 -6 days3 daysGb/Day 637962153301,200Flow cells11 or 2CapEx$100,000$250,000$740,000$10 Million (sold only in a pack of 10)HiSeq X Ten: 10 instruments most cost effective when operated at full capacity of 18,000 WGS/year
11 Overall Illumina Sequencing Workflow Sample PreparationSequencing Library PreparationAdapter1Adapter2SequencingPrimerInsertCluster GenerationHybridizing Library to Flow CellCreating clusters fromindividual moleculesIntroductory workflow--- good to start with the basics and go from hereExplain that these 3 steps are 3 separate kits that one purchases. They can work with their salesperson to determine which kits and in what amounts they want to purchase. Emphasize that for any of our products (genomic, expression, chip, etc) that you follow these 3 basic steps: Sample Prep (library prep); Cluster Generation on a Flowcell, and Sequencing on the Genome Analzyer.For Sample Prep--- the processes used in the kits end up with a construct illustrated for all sequencing types--- 2 different adaptors, a sequencing primer, and an insert. If the group will do paired end, can mention it’ll be slightly different adapters, and different sequencing primers on both ends of the insert (will be confusing for a new group--- can come back to this slide later if someone asks).Cluster generation-- Show them the flowcell picture--- 8 lanes for 8 different samples. Library hybridizes to flowcell with individual molecules forming clusters that will be sequenced. The different molecules of the library are physically separated from one another so the sequence of each one can be determined.Sequencing by Synthesis--- describe the general process with the reversible terminators. Can introduce the concept that the GA has a “chemistry cycle” where you are removing the last block and then adding the next particular base, then an imaging cycle.Sequencing by SynthesisAdd all 4 bases with Reversible TerminatorsImage 4 colorsRemove Terminator, repeat
12 Genomic Sample Prep Workflow Purified genomic DNA1. Genomic DNA fragmentationFragments of less than 800 bp2. End-repairBlunt ended fragments with 5’-Phosphorylated ends3. Klenow exo- with dATP3’-dA overhang4. Adapter ligationAdapter modified ends5. Gel purification/beadRemoval of unligated adapter6. PCRGenomic DNA LibraryWe’re using Genomic Sample Prep Workflow as an example of the basic sample prep protocol, each being different. All sample prep methods come with their own protocol which follow standard molecular biology cloning techniques.Adapter1Adapter2SequencingPrimerInsert
13 What is a Flow Cell? Ordered Flow Cells: highest number of clusters A flow cell is a thick glass slide with 8 channels or lanesP5 oligoP7 OligoEach lane is randomly coated with a lawn of oligos that are complementary to library adaptersAdapter1Adapter2InsertSequencing PrimerOrdered Flow Cells: highest number of clusters
14 Cluster Generation: Template hybridization and Initial Extension Original template is washed awayTemplatehybridizationInitial extensionDenaturation3' extensionOHOHP P5Grafted flowcellInitials steps for the PE chemistry are the same as the Single Read chemistry.single molecules bound to flow cell in a random pattern>200 million single molecules hybridize to the lawn of primers
15 Cluster Generation: Amplification Result: two copies of covalently bound single-stranded templatesSingle-strand flips over to hybridize to adjacent oligos to form a bridgeHybridized primer is extended by polymerasesDouble-stranded bridge is denatured2nd cycledenaturation1st cycleextension1st cycleannealing1st cycledenaturation2nd cycleannealingn=35total2nd cycleextensionAmplification steps are also the same except that 28 cycles is recommended for any samples where the insert is greater than 200 bp. More cycles for samples with insets greater then 200 bp will cause the clusters to get too large after P5 resynthesis (see slide 6)
16 Cluster Generation: Linearization, Blocking and sequencing Cluster Generation: Linearization, Blocking and sequencing primer hybridizationdsDNA bridges are denaturedcomplement strands are cleaved and washed awaysequencingprimerP5 LinearizationBlock withddNTPSDenaturation andSequencing PrimerHybridizationClusterAmplificationThe first linearization step uses the Linearization 1 Enzyme instead of Periodate. The blocking step still uses ddNTPs but uses Blocking Enzyme 1 and 2 in the PE protocol instead of terminal transferase for the Single Read protocol. Read 1 primer hybridization uses Read 1 PE Sequencing Primer. Enzymatic cleavage, uracyl incorporation enzymeFree 3’ ends are blocked to prevent unwanted DNA priming
17 Sequencing Resynthesis of P5 Strand (15Cycles) Sequencing First Read OHSequencingFirst ReadDenaturation andDe-ProtectionOHDenaturation andHybridizationP7 LinearizationOHSequencingSecond ReadDenaturation andHybridizationBlock withddNTPsThe steps up to and including the first read sequencing are pretty much the same as for a single read. The first read sequencing is where the single read protocol would stop. For the PE protocol, it continues with deprotecting the P5 primer using deprotection enzyme. Resynthesis of the P5 strand occurs over 15 cycles. P7 linearization uses Linearization 2 Enzyme. Blocking again occurs with ddNTPS and Blocking enzyme 1 and 2. Sequencing read 2 uses Read 2 PE Sequencing Primer. 5” to 3”
18 Reversible Terminator Seq Chemistry All 4 labeled nucleotides in 1 reaction (green, orange, red and blue)Advantages of reversible terminators:Only one base is added at a timeFluor can be cleaved off after the imaging. Thus, it does not emit color at the next cycle allowing only newly added base (with attached fluor) to emit the lightNext cycleIncorporationDetectionDeblock; fluor removalODNAHNN3’5’free 3’ endXOHOPPPHNNcleavagesitefluor3’block
19 Sequencing By Synthesis (SBS) 5’3’5’Cycle 1: Add sequencing reagentsFirst base incorporatedRemove unincorporated basesGTCADetect signal/ImagingTGCleave off fluor and DeblockCAGTCycle 2-n: Add sequencing reagents & repeatAll four labeled nucleotides in one reactionHigh accuracyBase-by-base sequencingNo problems with homopolymer repeats
20 Representation of Base Calling From Raw Data T G C T A C G A T …123789456T T T T T T T G T …Colorized marketing slide to represent what is going on here.Point out that each photo here represents 4 images that you would usually get, one for each base. The software determines the “winner” for intensity and calls the base.The identity of each base of a cluster is read off from sequential images
21 Primary and Secondary Analysis Overview Analysis TypeSoftwareOutputsImages/TIFF filesSequencingICS/RTABase CallingIntensitiesPrimary AnalysisICS/RTAConsensus Assessment of Sequence And VAriation (CASAVA) :Bulldog NAlignments and Variant DetectionSecondary Analysis
22 4 Ion Protons: coming soon Ion Torrent PGM and ProtonIon PGM™ Sequencer4 Ion Protons: coming soonFirst PostLight sequencing technology: Instead of using light as an intermediary, PGM creates a direct connection between the chemical and the digital worlds.
23 Uses semiconductor chips for sequencing. The Chip is the MachineUses semiconductor chips for sequencing.Ion 314 Chip v.2Ion 316 Chip v.2Ion 318 Chip v.2Wells1.3 million6.3 million11 millionOutput200 base400 base30-50Mb60-100MbMb600 Mb-1Gb600Mb-1Gb1.2-2 GbIon PI chip: >165 million wells per chip: 8 to 10 Gb data per runIon PII chips: ~100 Gb of data in ~4 hours
24 Base CallingWhen a nucleotide is incorporated into a strand of DNA, a Hydrogen ion is released as a by product. The H ion carries a charge which the PGM’s ion sensor can detect as a base.
25 Advantages and Current Limitations Low equipment costRapid run times: 3 to 4 hoursSimple ChemistryLimitationsHomopolymers detectionError ratesSlow on introducing newer chips: OverpromisePGM and Proton: two separate sequencing equipmentTedious Library prep protocols
27 The Third Generation Sequencing Platform: PacBio RS Pacific Biosciences has developed Single Molecule Real Time (SMRT™) DNA sequencing technology: PacBio RS.This technology enables, for the first time, the observation of natural DNA synthesis by a DNA polymerase as it occurs.This technology delivers long reads at single molecule level and fast time to result, enabling a new paradigm in genomic analysis.Most people here are familiar with the Sanger sequencing which is the so callled first generation sequencing; and second generation sequencing technology such as illumina hiseq system. It starts with library prep with PCR amplification and cluster building. After sequencing, it generate tens of millions of short reads. Today I am going to introduce you the third generation sequencing platform pacbio RS, developed by Pacific Biosciences that can do single molecule real time sequencing. the technology is called SMRT for single molecule real time. This technology enables, for the first time, the observation of natural DNA synthesis by a DNA polymerase as it occurs. The major advantages of this new sequencing technology is that it can delivers long reads at single molecule level and fast time to results.
29 Key Applications for PacBio RS Targeted sequencingSNP and structure variants detectionRepetitive regionFull length transcript profilingDe novo assembly and genome finishingBacteria genomeFungal genomeGap-captured sequencingTargeted captured sequencingBase modifications detectionMethylationsDNA damagesFirst I want to show you a paper published in nature last week using pacbio sequencing to identifiy mutations in a kinase FLT3, which is associated with AML % of the aml patient would have this ITD mutation. This is an activating mutation and there are drugs can effectively inhibits the kinase. But the problem is that the drug develops resistance after certain time. And the drug resistance is likely caused by few mutations in the kinase domain. So to find out whether the ITD mutation and the drug resistance mutation are really the disease causing mutations, they have to determine whether any resistant mutations found were from the same strand as the FLT3-ITD.**Projects at YCGAYCGA PacBio RS
30 Comparisons Between PacBio RS and Illumina HiSeq PacBio RS (Third generation)Illumina HiSeq (Second generation)Sequencing ChemistrySequencing by synthesis (SBS)Single Molecule Real Time (SMRT)Sequencing substrateSmart Cell made up of150,000 ZMWsFlow cell has made of8 separate lanesData output per day1 to 2 billion/ day.60-1,200 billion/dayCost/Mb$1.5/Mb$.04 per MbRead LengthAverage up to 5 Kb50bp to 150bpError ratesRaw: %. With 30x coverage: Q50 (< 0.01)0.5 to 1 %Sample LibrarySMRT Bell template(Single-strand circular DNA) 250 bp to 10 Kb insertdsDNA with adaptors (175 bp to 1 Kb)As shown in this table. Bothe technologies are using the sequencing by synthesis chemistry. The difference for pacbio is that it performs the sequencing at the single molecule level and in real time. the sequencing comsumables are also different. In hiseq, the sequencing is carried on the flowcell, that has 8 lanes, each lane has millions of DNA clusters. For pacbio, the sequencing is carried on a SMRT cell, which is comprised of 150k microscopic holes called ZMWs which stands for Zero Mode Wavelenghth. Each Zmw is a can hold one dna molecule with primer and polymerase. For the base calling, illumina is using the images taken during the sequencing run. And pacbio is using the movies that is collected in real time while the dna synthesis is happening.
34 Performance/Limitations…..? AdvantagesNanopores offer a label-free, electrical, single-moleculeDNA sequencing methodNo costly fluorescent labeling reagentsNo need for expensive optical hardware and sophisticatedinstrumentation to detect DNA, RNA and Protein.Performance/Limitations…..?First data was released in Feb No updates since thenNo data available for the evaluation: High Error Rates - >5%Will start early access program in the next few months
36 Located in a newly renovated building. YCGA was established in January 2009 through generous funding support and the strong commitment from the Yale University and School of MedicinePortion of the laboratory showing sequencing systems through the glass wall partition that separates laboratory from the rest of office and administrative area.Located in a newly renovated building.Approximately 7,000 Sq Ft laboratory and ~4,000 Sq Ft officespace20 staff
37 Sequencing Platforms at YCGA 10 Illumina HiSeqsOne MiSeqOne PacBio RSIon PGM™ SequencerWill acquire new Illumina sequencers introduced just few weeks agoYCGA is well equipped with cutting edge technologies . Since the technology keeps improving at a very fast pace, it has been a challenge to keep up with it. New technologies are expensive and some times we have to change the platform before we have recovered the investments. Despite numerous challenges YCGA has been very successful in keeping up with the change while maintaining data production and balancing operating budget..YCGA has kept pace with cutting-edge sequencing technologies
38 Computer Infrastructure BulldogN: provides ~1300 cores and 2.2 PB of high performance storage.Dave Frioni and their team from Yale ITS.Robert Bjornson, Ph.D. IT director for YCGANicholas Carriero, Ph.D.
39 Increasing Demand for Sequencing at YCGA Increase in the number of Principle Investigators using YCGA over the past 4 yearsTrends of sequence data output at YCGA (average of 6 months)
40 Types of samples processed and runs of sequence read lengths carried out at YCGA in a typical month
41 Whole-Genome VS. Whole-Exome Sequencing Protein coding genes (exome) constitute 1% of the human genome but harbor 85 % of disease causing mutationsSignificantly cheaper than sequencing entire genomeData storage challengesValidation challengesMaq,>25,000 exomes analyzed for several disorders including Cardiovascular, abnormal brain development, autism, liver, kidney, hypertension, skin and various tumors.2.1M probes cover ~300,000exons of 19,000 genesTotal covered bases: 44.1Mb
42 Need for strong R&D efforts for Next-Generation sequencing operation Optimization of sample preparation protocols for exome capture that have decreased the cost of a single human exome from $8,000 in 2009 to the current price of ~$500, while improving the quality of the data.Development of a highly efficient protocol to extract and repair DNA from formalin-fixed paraffin embedded blocks for genetic analysis.Improved protocols for gDNA-seq, RNA-seq, and ChIP-seq that show higher data complexity than traditional protocols, allow users to start with less material, and cost less.This point is extremely significant because >90% of our sequencing is human exomes. The improvements we have made have increased our data quality, decreased our costs, and allowed us to dramatically increase our throughput.There are likely billions of formalin-fixed paraffin embedded (FFPE) samples around the world. The fixation/storage process destroys the DNA and it was thought these samples would be unusable for genetic analysis. Our protocol allows us to use these samples for exome analysis and makes many new and interesting experiments possible that would otherwise be impossible to perform.By spending the time and money to improve all of our protocols – not just human exomes – we are able to offer the Yale community a variety of sample preparation options that produce the most complex data possible at some of the lowest costs in the country.
43 Scientific and economic impact of high throughput sequencing at Yale
44 List of select publications resulting form the next-generation sequencing usage at YCGA Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations. Bilguvar and GunelNature, v467, 2010A Novel miRNA Processing Pathway Independent of Dicer Requires Argonaute2 Catalytic Activity. Cifuentes and GiraldezScience, v328, 2010Mitotic recombination in patients with ichthyosis causes reversion of dominant mutations in KRT10. Choate and Lifton.Science, v330, 2010Transcriptomic analysis of avian digits reveals conserved and derived digit identities in birds. Wang and WagnerNature, v477, 2011Transposom-mediated rewiring of gene regulatory networks contributed to the evolution of pregnancy in mammals. Lynch and WagnerNature, Genet. v43, 2011K+ channel mutations in adrenal aldosterone-producing adenomas and hereditary hypertension. Choi and LiftonScience, v331, 2011Recessive LAMC3 mutations cause malformations of occipital cortical development. Barak and Gunel.Nat Genet., V43, 2011Spatio-temporal transcriptome of the human brain. Kang and SestanNature, v478, 2011Langerhans cells facilitate epithelial DNA damage and squamous cell carcinoma. Modi and GirardiScience, v335, 2012Mutations in kelch-like 3 and cullin 3 causes hypertension and electrolyte abnormalities. Boyden and LiftonNature, v482, 2012De novo point mutations, revealed by whole-exome sequencing, are strongly associated with Autism Spectrum Disorders. Sanders and StateNature, v485, 2012Exome sequencing identifies recurrent somatic RAC1 mutations in melanoma. Krauthammer and HalabanNat Genet., V44, 2012Genomic Analysis of Non-NF2 Meningiomas Reveals Mutations in TRAF7, KLF4, AKT1, and SMO. Clark and GunelScience, v339, 2013De novo mutations in histone-modifying genes in congenital heart disease. Zaidi and LiftonNature, v498, 2013Recessive mutations in DGKE cause atypical hemolytic-uremic syndrome. Lemaire and LiftonNat Genet., V45, 2013Somatic and germline CACNA1D calcium channel mutations in aldosterone-producing adenomas and primary aldosteronism. Scholl and LiftonThe evolution of lineage-specific regulatory activities in the human embryonic limb. Cotney and NoonanCell, v154, 2013Mutations in DSTYK and dominant urinary tract malformations. Sanna-Cherchi and GharaviN Eng J Med., V369, 2013Nanog, Pou5f1 and SoxB1 activate zygotic gene expression during the maternal-to-zygotic transition. Lee and GiraldezNature, 2013Co-expression networks implicate human mid-fetal deep cortical projection neurons in the pathogenesis of autism. Willsey and StateCell, 2013 (In press)
45 Impact of High Throughput Sequencing: Partial Grant Funding Mendelian center grant, NIH $12M (3y)Gilead cancer grant $40M (4y)Brain tumor gift $12M (4y)ARRA brain development (NIH) $ 3M (2y)ARRA kidney disease (NIH) $ 2M (2y)Simons autism sequencing $ 4M (3y)Brain transcriptome (NIH) $10M (2y)Congenital heart disease (NIH) $ 3M (4y)Melanoma Spore $12M (5y)Fidelity (Computer storage) $ 0.55MVA- Schizophrenia/Bipolar disorder $12.3 MYale Comprehensive Cancer Center $14.0 MTotal $ M
46 The Centers for Mendelian Genomics Supported by NHGRI and NHLBI in Dec 2011
47 CMGs: GoalsDiscover the genes and variants responsible for as many Mendelian phenotypes as possibleDevelop and disseminate improved methods for disease gene discovery and analysisCreate public resources to enhance research and discovery activitiesEducate colleagues and public regarding Mendelian diseaseWhole-Exome/whole-genome analysis is carried out at no cost and on a collaborative basis.Investigators with interesting patients or cohorts can contact us
48 Opportunities: DNA Sequencing and Personalized Medicine Use of genomics, the science of looking at all of the information in the human genome, to tailor medical care to individuals based on their genetic makeup.Earlier interventionsImproved diagnosisMore effective drug developmentBetter medical outcomeDNA sequencing has a very bright future and will change the current way of medical practive.
49 # 1 invention of the 2008 year by time magazine
50 CLIA: The New Paradigm in Molecular Diagnostics Conventional molecular testing- gene by geneGenomic testing using Exome analysisYCGA is carrying out clinical diagnostic work in collaboration with Dr. Allen BaleOver 500 exomes are analyzed for various disorders
51 Major Challenges Cost associated with being cutting-edge A) Equipment:Rapid introduction of new Sequencing technologies: Investment challengeChallenges associated with new technologies: Upgrade and BreakdownB) Reagents:Constant change/introduction of new more reliable reagents (v3)C) Software upgrades/computer infrastructure challenges/analysis.E) Constant upgrade of protocols:Sequence capture: 6 versions and continues to be updatedRNA-Seq: True seq Cheaper than older version
52 Sequencing a genome is simple finding a cause of a disease is not
53 Despite challenges, tremendous progress has been made at a rapid pace. NGS will continue to make a huge impact in biology, bio-medicine andhuman health.
54 Thank You!Jim NoonanYale University and Medical School and West Campus administrationITS, HPC and Bioinformatics staffYCGA staffCollaborating Yale Investigators