Goals of biomedical investigation Understand normal, healthy and disease biology Enable prevention and early diagnosis of disease Enable new effective treatments Unbiased approach to identify new pathways underlying basic physiology, health and disease Utility of Next Generation Sequencing/genetics in medicine
Evolution of genomic technologies Genetic mapping studies: Discovery of genes for well characterized Mendelian diseases. Dense SNP genotyping using microarray technology: GWAS for discovery of common variants in common disease. High throughput sequencing: Discovery of rare variants in not previously recognized Mendelian diseases and common diseases.
DNA sequencing can provide a deeper understanding about DNA/RNA than any other technology Microarray Technology revolutionized biomedical research, but has several limitations, which DNA sequencing may overcome As the cost of sequencing is rapidly decreasing, it is becoming affordable to perform sequencing at a genome level DNA sequencing can provide a deeper understanding about DNA/RNA than any other technology Microarray Technology revolutionized biomedical research, but has several limitations, which DNA sequencing may overcome As the cost of sequencing is rapidly decreasing, it is becoming affordable to perform sequencing at a genome level Number of PubMed Articles In recent years there has been an explosion of research articles using next generation sequencing technologies
1980 Nobel Prize in chemistry phi X 174 ~5300 bp gels read by hand radiolabeled dideoxyNTPs one lane per nucleotide 800 bp reads low throughput (several kb/gel) First Generation: Sanger sequencing (1975-1977)
Second-generation sequencing Massively parallel sequencing of millions of template 454/Roche Illumina Ion Torrent-Proton
HiSeq 2500 Sequencing System Fast turnaround and highest output in a single instrument 1 human genome in a day 1 human genome in a day High Output Mode 600 Gb in ~10.5 days Current v3 flow cell Current v3 reagents cBot required High Output Mode 600 Gb in ~10.5 days Current v3 flow cell Current v3 reagents cBot required Rapid Run Mode 120Gb in ~1 day New 2-lane flow cell New reagents No cBot required Rapid Run Mode 120Gb in ~1 day New 2-lane flow cell New reagents No cBot required 1 Instrument – 2 Run Modes User configurable 6 human genomes in 10.5 days 6 human genomes in 10.5 days Highest OutputFastest turnaround
New sequencing platforms by Illumina HiSeq X Ten and HiSeq X Five: Production-scale human whole genome sequencing: 18,000 genomes/year at $ 1,500 cost/genome HiSeq 3000/HiSeq 4000: Up to 1.5 Tb/run. Whole genome as well as other applications including exome sequencing
Overall Illumina Sequencing Workflow Sample Preparation Sequencing Library Preparation Adapter1Adapter2 Sequencing Primer Insert Cluster Generation Hybridizing Library to Flow Cell Creating clusters from individual molecules Sequencing by Synthesis Add all 4 bases with Reversible Terminators Image 4 colors Remove Terminator, repeat
Genomic Sample Prep Workflow Purified genomic DNA 1. Genomic DNA fragmentation Fragments of less than 800 bp 2. End-repair Blunt ended fragments with 5’-Phosphorylated ends 3. Klenow exo- with dATP 3’-dA overhang 4. Adapter ligation Adapter modified ends 5. Gel purification/bead Removal of unligated adapter 6. PCR Genomic DNA Library Adapter1Adapter2 Sequencing Primer Insert
What is a Flow Cell? A flow cell is a thick glass slide with 8 channels or lanes Each lane is randomly coated with a lawn of oligos that are complementary to library adapters Adapter1 Adapter2 Insert Sequencing Primer P5 oligo P7 Oligo Index
Reversible Terminator Seq Chemistry O PPP HN N O O cleavage site fluor 3’3’ block Next cycle Incorporation Detection Deblock; fluor removal O DNA HN N O O 3’3’ O 5’5’ free 3’ end X OH All 4 labeled nucleotides in 1 reaction (green, orange, red and blue) Advantages of reversible terminators: Only one base is added at a time Fluor can be cleaved off after the imaging. Thus, it does not emit color at the next cycle allowing only newly added base (with attached fluor) to emit the light
Sequencing By Synthesis (SBS) 5’5’ G T C A G T C A G T C A G T 3’3’ 5’5’ C A G T C A T C A C C T A G C G T A First base incorporated Cycle 1: Add sequencing reagents Remove unincorporated bases Detect signal/Imaging Cycle 2-n: Add sequencing reagents and repeat All four labeled nucleotides in one reaction High accuracy Base-by-base sequencing No problems with homopolymer repeats Cleave off fluor and Deblock
4 Ion Protons: coming soon Ion PGM™ Sequencer Ion Torrent PGM and Proton First PostLight sequencing technology : Instead of using light as an intermediary, PGM creates a direct connection between the chemical and the digital worlds.
The Chip is the Machine Ion PI chip: >165 million wells per chip: 8 to 10 Gb data per run Ion PII chips: ~100 Gb of data in ~4 hours Uses semiconductor chips for sequencing.
Base Calling When a nucleotide is incorporated into a strand of DNA, a Hydrogen ion is released as a by product. The H ion carries a charge which the PGM’s ion sensor can detect as a base. Ion Torrent technology video.
Advantages and Current Limitations Advantages Low equipment cost Rapid run times: 3 to 4 hours Simple Chemistry Limitations Homopolymers detection Error rates Slow on introducing newer chips: Overpromise PGM and Proton: two separate sequencing equipment Library prep: Emulsion PCR/ New protocols
PacBio RS Third generation sequencing
The Third Generation Sequencing Platform: PacBio RS Pacific Biosciences has developed Single Molecule Real Time (SMRT ™ ) DNA sequencing technology: PacBio RS. This technology enables, for the first time, the observation of natural DNA synthesis by a DNA polymerase as it occurs. This technology delivers long reads at single molecule level and fast time to result, enabling a new paradigm in genomic analysis.
Pacific Biosciences SMRT® Technology Technology Video
Key Applications for PacBio RS Targeted sequencing SNP and structure variants detection Repetitive region Full length transcript profiling De novo assembly and genome finishing Bacteria genome Fungal genome Gap-captured sequencing Targeted captured sequencing Base modifications detection Methylations DNA damages **Projects at YCGA YCGA PacBio RS
Comparisons Between PacBio RS and Illumina HiSeq PacBio RS (Third generation) Illumina HiSeq (Second generation) Sequencing Chemistry Sequencing by synthesis (SBS) Single Molecule Real Time (SMRT) Sequencing by synthesis (SBS) Sequencing substrate Smart Cell made up of 150,000 ZMWs Flow cell has made of 8 separate lanes Data output per day 1 to 2 billion/ day. $1.5/ Mb 60 billion/day at a cost of $.06 per Mb Read LengthAverage up to 5 Kb50bp to 150bp Error rates Raw: 10-15 %. With 30x coverage: Q50 (< 0.01) 0.5 to 1 % Sample Library SMRT Bell template (Single-strand circular DNA) 250 bp to 10 Kb insert dsDNA with adaptors (175 bp to 1 Kb)
Two types of nanopores: Protein and synthetic (silicon nitride). Protein nanopores appear to be better in recognizing nucleotides. The rapid speed at which DNA strands pass through the tiny hole makes distinguishing bases more difficult. Currently an enzyme is used to control the rate. By shining low power green laser on synthetic nanopore immersed in salt water it is possible to manipulate DNA speed at will. As the current increases, positive ions drag water molecules in the opposite direction of incoming DNA, acting as a brake and slowing its passage through the pore. As a result, nanoscale sensors in the pore would be more accurately able to read each nucleotide going into the pore. Using nanopores, long stretches of DNA can be zipped back and forth through the pore and can be read several times Protein nanopoers can also identify epigenetic changes. Meller A. et al, Nat Biotech 2013 Recent advances in nanopore sequencing
Performance/Limitations…..? First data was released in Feb 2012. Since then slow to release new data Very little data available for the evaluation: High Error Rates - >5% Advantages Nanopores offer a label-free, electrical, single-molecule DNA sequencing method No costly fluorescent labeling reagents No need for expensive optical hardware and sophisticated instrumentation to detect DNA bases Advantages Nanopores offer a label-free, electrical, single-molecule DNA sequencing method No costly fluorescent labeling reagents No need for expensive optical hardware and sophisticated instrumentation to detect DNA bases
The YCGA Laboratory at West Campus
Portion of the laboratory showing sequencing systems through the glass wall partition that separates laboratory from the rest of office and administrative area. YCGA was established in January 2009 through generous funding support and the strong commitment from the Yale University and School of Medicine Located in a newly renovated building. Approximately 7,000 Sq Ft laboratory and ~4,000 Sq Ft office space 23 staff
Sequencing Platforms at YCGA 11 Illumina HiSeqs (2000 and 2500) One PacBio RS One MiSeq YCGA has kept pace with cutting-edge sequencing technologies Ion PGM™ Sequencer
Types of samples processed and runs of sequence read lengths carried out at YCGA in a typical month. Types of samples processed and runs of sequence read lengths carried out at YCGA in a typical month
Need for strong R&D efforts for Next-Generation sequencing operation Optimization of sample preparation protocols for exome capture that have decreased the cost of a single human exome from $8,000 in 2009 to the current price of ~$500, while improving the quality of the data. Development of a highly efficient protocol to extract and repair DNA from formalin-fixed paraffin embedded blocks for exome analysis. Improved protocols for gDNA-seq, RNA-seq, and ChIP-seq that show higher data complexity than traditional protocols, allow users to start with less material, and cost less. Continuous improvements of various analysis pipelines
Whole- Genome VS. Whole Exome Sequencing 2.1M probes cover ~300,000 exons of 19,000 genes Total covered bases: 44.1Mb Protein coding genes (exome) constitute 1% of the human genome but harbor 85 % of disease causing mutations Significantly cheaper than sequencing entire genome
Scientific and economic impact of high throughput sequencing at Yale
List of select publications resulting form the next-generation sequencing at YCGA Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations. BilguvarNature, v467, 2010 A Novel miRNA Processing Pathway Independent of Dicer Requires Argonaute2 Activity. CifuentesScience, v328, 2010 Mitotic recombination in ichthyosis causes reversion of dominant mutations in KRT10. Choate KScience, v330, 2010 Transcriptomic analysis of avian digits reveals conserved and derived digit identities in birds. Wang s.Nature, v477, 2011 Transposom-mediated rewiring of gene regulatory networks contributed to the evolution of pregnancy in mammals. Lynch and Wagner Nature, Genet. v43, 2011 K + channel mutations in adrenal aldosterone-producing adenomas and hereditary hypertension. Choi MScience, v331, 2011 Recessive LAMC3 mutations cause malformations of occipital cortical development. Barak and Gunel.Nat Genet., V43, 2011 Spatio-temporal transcriptome of the human brain. Kang and SestanNature, v478, 2011 Langerhans cells facilitate epithelial DNA damage and squamous cell carcinoma. Modi and GirardiScience, v335, 2012 Mutations in kelch-like 3 and cullin 3 causes hypertension and electrolyte abnormalities. Boyden et alNature, v482, 2012 De novo point mutations, revealed by whole-exome sequencing, are strongly associated with Autism Spectrum Disorders. Sanders and State Nature, v485, 2012 Exome sequencing identifies recurrent somatic RAC1 mutations in melanoma. KrauthammerNat Genet., V44, 2012 Genomic Analysis of Non-NF2 Meningiomas Reveals Mutations in TRAF7, KLF4, AKT1,& SMO. Clark V et al Science, v339, 2013 De novo mutations in histone-modifying genes in congenital heart disease. Zaidi and LiftonNature, v498, 2013 Recessive mutations in DGKE cause atypical hemolytic-uremic syndrome. Lemaire and LiftonNat Genet., V45, 2013 Somatic and germline CACNA1D calcium channel mutations in aldosterone-producing adenomas and primary aldosteronism. Scholl and Lifton Nat Genet., V45, 2013 The evolution of lineage-specific regulatory activities in the human embryonic limb. Cotney and NoonanCell, v154, 2013 Mutations in DSTYK and dominant urinary tract malformations. Sanna-Cherchi and Gharavi N Eng J Med., 2013 Nanog, and SoxB1 activate zygotic gene expression during the maternal-to-zygotic transition. Lee et alNature, 2013 Co-expression networks implicate human mid-fetal deep cortical projection neurons in the pathogenesis of autism. Willsey and State Cell, 2013 CLP1 Founder Mutation Links tRNA Splicing and Maturation to Cerebellar Development and Neurodegeneration. Schaffer AE and Gleeson JG. Cell, V157, 2014 Exome sequencing links corticospinal motor neuron disease to common neurodegenerative disorders. Novarino G and Gleeson JG. Science, V363, 2014
Mendelian center grant, NIH $12M (3y) Gilead cancer grant $40M (4y) Brain tumor gift $12M (4y) ARRA brain development (NIH) $ 3M (2y) ARRA kidney disease (NIH) $ 2M (2y) Simons autism sequencing $ 4M (3y) Brain transcriptome (NIH) $10M (2y) Congenital heart disease (NIH)$ 5M (4y) Pediatric Cardiac Genomic Consortium$ 2M (2Y) Melanoma Spore (NIH)$12M (5y) Biogen Inc. (PPMS)$ 2 M VA- Schizophrenia/Bipolar disorder$12 M Yale Comprehensive Cancer Center $14 M Total $ 128 M Impact of High Throughput Sequencing: Grant Funding (partial list)
Which treatment? What are my chances? Which class of cancer? Is it benign? TherapeuticChoice Prognosis Diagnosis Classification How and why Discovery Discovery Elucidation of mechanism of cause Identification of cancer biomarkers Therapeutic targets Use of genomics to tailor medical care to individuals based on their genetic makeup.
CLIA: The New Paradigm in Molecular Diagnostics YCGA is carrying out clinical diagnostic work in collaboration with Dr. Allen Bale Over 1,000 exomes are analyzed for various disorders Conventional molecular testing- gene by gene Genomic testing using Exome analysis
Sequencing a genome is simple finding a cause of a disease is not First clinical use of whole genome sequencing shows just how challenging it can be. Study of fraternal twins with monogenic disorder Genomes on prescription: Nature 2011 Bainbridge M, Sci Transl Med 2011
Acknowledgement Jim Noonan Yale University, School of Medicine and west Campus NHGRI: CMG YCGA staff
Data Analysis Overview
Alignments and Variant Detection Images/TIFF files Base Calling Intensitie s Software Outputs Primary and Secondary Analysis Overview Analysis Type Primary Analysis Secondary Analysis Sequencing ICS/RTA
OH Grafted flowcell P7 P5 Cluster Generation: Amplification Template hybridization and Initial Extension Template hybridization Initial extension Denaturation >250-300 million single molecules hybridize to the lawn of primers 3' extension single molecules bound to flow cell in a random pattern Original template is washed away
1 st cycle denaturation n=35 total Cluster Generation: Amplification 2 nd cycle denaturation 1 st cycle annealing 1 st cycle extension 2 nd cycle annealing 2 nd cycle extension Single-strand flips over to hybridize to adjacent oligos to form a bridge Hybridized primer is extended by polymerases Double- stranded bridge is denatured Result: two copies of covalently bound single- stranded templates
Cluster Generation: Linearization, Blocking and sequencing primer hybridization Cluster Amplification P5 LinearizationBlock with ddNTPS Denaturation and Sequencing Primer Hybridization dsDNA bridges are denatured complement strands are cleaved and washed away Free 3’ ends are blocked to prevent unwanted DNA priming sequencing primer
Sequencing Denaturation and Hybridization Sequencing First Read Denaturation and De-Protection OH Resynthesis of P5 Strand (15Cycles) OH P7 Linearization OH Block with ddNTPs Denaturation and Hybridization Sequencing Second Read