3 Falling cost of sequencing DeWitt, Nat. Biotechnol. 2012
4 Sequencing human genomes 200120112013 (?)1000 Genomes ProjectThe Human GenomeYour Genome~ 3 Billion $~ $$
5 Outline Overview of Next-Generation Sequencing (NGS) Applications ChallengesSolutions
6 Sequencing Revolution Sanger sequencingNext-Generation sequencingMetzker, Nat. Rev. Genet. 2010100s of reactions…10000s of base pairs…Millions of reactions!Billions of base pairs!
7 High-throughput Sequencing 36bp X 20MX 8 lanes20096 Gbases600 Gbases2 X 150bp X 250MX 8 lanes200 Human Genomes in 1 run!!!2013
8 NGS Technology Comparison instrumentPacbioIon Torrent454IlluminaSOLiDMethodSingle-molecule in real-timeIon semiconductorPyrosequencingsynthesisLigationRead length3kb average200 bp700 bp50 to 250 bp50+35 or bpError typeindelsubstitutionA-T biassingle-Pass Error rate %13~1~0.1Reads per run35000–75000up to 4M1Mup to 3.2G1.2 to 1.4GTime per run30 minutes to 2 hours2 hours24 hours1 to 10 days,1 to 2 weeksCost per 1 million bases (in US$)$2$1$10$0.05 to $0.15$0.13AdvantagesLongest read length. Fast.Less expensive equipment. Fast.Long read size. Fast.high sequence yield, cost, accuracyLow cost per base.DisadvantagesLow yield at high accuracy. Equipment can be very expensive.Homopolymer errors.Runs are expensive. Homopolymer errors.Equipment can be very expensive.Slower than other methods, read length, longevity of the plateform
9 Genome Canada > $915M investment and > $900M in co-funding 100s Large-scale genomics projects5 Innovation centers
10 Outline Overview of Next-Generation Sequencing (NGS) Applications ChallengesSolutions
11 Applications (I) De novo sequencing From the human genome… To all model organisms… To all relevant organisms (e.g. extreme genomes)… To “all” organisms?
12 Human Genome 3 Billion DNA base pairs (bp) Two human genomes are ~99.9% identicalThere are about ~3M bp differences between you and meSome of these differences explain variation in:Disease susceptibilityDifferences in drug metabolism…
13 Applications (II) Genome re-sequencing Genetic disorders Cancer genome sequencingMap genomic structural variations across individualsGenealogy and migrationAgricultural crops…The Cancer Genome Atlas1000 Genomes Project
14 Exome sequencing for Mendelian disease “… about one-half to one-third (~3,000) of all known or suspected Mendelian disorders (for example, cystic fibrosis and sickle cell anaemia) have been discovered. However, there is a substantial gap in our knowledge about the genes that cause many rare Mendelian phenotypes.”“Accordingly, we can realistically look towards a future in which the genetic basis of all Mendelian traits is known, …”
18 Mutations in paediatric gliblastoma Jabado, Pfister and Majewski
19 Mutations in paediatric gliblastoma Sequenced the exomes of 48 paediatric GBM samples, found:Somatic mutations in the H3.3-ATRX-DAXX chromatin remodelling pathway in 44% of tumoursRecurrent mutations in H3F3A, which encodes the replication-independent histone 3 variant H3.3 in 31% of tumours
20 Applications (III) Quantitative biology of complex systems New high-throughput technologies in functional genomics: ChIP-Seq, RNA-Seq, ChIA-PET, RIP-Seq, …From single-gene measurements, to thousands of probes on arrays, to profiles covering all 3B bases of the genomeImportant systems: Stem cells, Cancer, Infectious diseases…
21 Outline Overview of Next-Generation Sequencing (NGS) Applications ChallengesSolutions
22 High-throughput Sequencing 36bp X 20MX 8 lanes20096 Gbases600 Gbases2 X 150bp X 250MX 8 lanes200 Human Genomes in 1 run!!!2013
23 Big Data 2013 2 X 10 TBytes 1 TBytes Image files Intensity files Reads + qualities
24 Big Data 2013 2 X 10 TBytes 1 TBytes Intensity files Reads + qualities From: Alexandre MontpetitSubject: news from IlluminaDate: 4 June, :15:16 PM EDTTo: Guillaume BourqueDe Mark Van Oene (vp Illumina ventes): dans la prochaine annee on doit s'attendre a 2x plus de reads en 2x moins de temps (et 2x plus longs)Ca cause probleme?Alex20132 X 10 TBytes1 TBytesIntensity filesReads + qualities12 TBytes240 TBytes25 TB of raw data / month300 TB of raw data / year
25 Large NGS project Cancer project with whole genome data: 500 tumors 500 matched-normalvs125 TB raw125 TB raw500 X 3 lanes = 500 X 250GB500 X 3 lanes = 500 X 250GB
26 DNA bases sequenced at the Innovation Center 12 HiSeqs72 Trillions!0r 800 genomes at 30XDNA bases
28 Biomedical research is built on data integration Your data
29 Biomedical research is built on data integration 100XYour data
30 Challenges NGS instruments generate TBs of data NGS instruments are getting faster, cheaper and will increasingly be found in small research labs and hospitalsData sharing and integration is critical in biomedical researchSequencing data represents sensitive private data and is identifiable
31 Outline Overview of Next-Generation Sequencing (NGS) Applications ChallengesSolutions
32 Nanuq software Has tracked data and meta-data for more than: 2.6 million sample aliquots,20,500 reagents,17,000 plates,140,000 tubes,Multiple platforms, technologies and workflows(sequencing, genotyping, microarray, etc.)3,900 external users
34 Data center at the Innovation Center > 1200 cores> 2 PB disk> 5 PB tape
35 Need more!UdeS Mammouth – coresMcGill Guillimin – cores
36 Data processing issues We have many different projects all needing space and processing.We want to use the Compute Canada clusters for scalability but also to facilitate data distribution (we have >800 users).This brings uniformity problems:Different setups Hardware and SoftwareDifferent configurationsEtc.
37 Our strategyWe wrote analyses pipelines to be easily configurable across clusters.Same code, one ini file to customize (we already have templates for 3 cluster sites)We install Linux modules readable by all on all these clusters so we know exactly what is available everywhereWe also deploy common genomes across sites.
39 Canadian Epigenetics, Environment and Health Research Consortium (CEEHRC) ( )
40 PORTal for the Analysis of Genetics and Genomics Experiments (PORTAGGE)
41 ConclusionsNGS offers a variety of technologies and numerous exciting applicationsMany areas of NGS data analyses are still under active development (e.g. RNA-Seq)A major challenge is to ensure sufficient compute and storage capacities not to limit more advanced analysesNeed to work together to avoid duplication of efforts in installing tools but also to develop efficient ways to use HPC in biomedical research