3Falling cost of sequencing DeWitt, Nat. Biotechnol. 2012
4Sequencing human genomes 200120112013 (?)1000 Genomes ProjectThe Human GenomeYour Genome~ 3 Billion $~ $$
5Outline Overview of Next-Generation Sequencing (NGS) Applications ChallengesSolutions
6Sequencing Revolution Sanger sequencingNext-Generation sequencingMetzker, Nat. Rev. Genet. 2010100s of reactions…10000s of base pairs…Millions of reactions!Billions of base pairs!
7High-throughput Sequencing 36bp X 20MX 8 lanes20096 Gbases600 Gbases2 X 150bp X 250MX 8 lanes200 Human Genomes in 1 run!!!2013
8NGS Technology Comparison instrumentPacbioIon Torrent454IlluminaSOLiDMethodSingle-molecule in real-timeIon semiconductorPyrosequencingsynthesisLigationRead length3kb average200 bp700 bp50 to 250 bp50+35 or bpError typeindelsubstitutionA-T biassingle-Pass Error rate %13~1~0.1Reads per run35000–75000up to 4M1Mup to 3.2G1.2 to 1.4GTime per run30 minutes to 2 hours2 hours24 hours1 to 10 days,1 to 2 weeksCost per 1 million bases (in US$)$2$1$10$0.05 to $0.15$0.13AdvantagesLongest read length. Fast.Less expensive equipment. Fast.Long read size. Fast.high sequence yield, cost, accuracyLow cost per base.DisadvantagesLow yield at high accuracy. Equipment can be very expensive.Homopolymer errors.Runs are expensive. Homopolymer errors.Equipment can be very expensive.Slower than other methods, read length, longevity of the plateform
9Genome Canada > $915M investment and > $900M in co-funding 100s Large-scale genomics projects5 Innovation centers
10Outline Overview of Next-Generation Sequencing (NGS) Applications ChallengesSolutions
11Applications (I) De novo sequencing From the human genome… To all model organisms… To all relevant organisms (e.g. extreme genomes)… To “all” organisms?
12Human Genome 3 Billion DNA base pairs (bp) Two human genomes are ~99.9% identicalThere are about ~3M bp differences between you and meSome of these differences explain variation in:Disease susceptibilityDifferences in drug metabolism…
13Applications (II) Genome re-sequencing Genetic disorders Cancer genome sequencingMap genomic structural variations across individualsGenealogy and migrationAgricultural crops…The Cancer Genome Atlas1000 Genomes Project
14Exome sequencing for Mendelian disease “… about one-half to one-third (~3,000) of all known or suspected Mendelian disorders (for example, cystic fibrosis and sickle cell anaemia) have been discovered. However, there is a substantial gap in our knowledge about the genes that cause many rare Mendelian phenotypes.”“Accordingly, we can realistically look towards a future in which the genetic basis of all Mendelian traits is known, …”
18Mutations in paediatric gliblastoma Jabado, Pfister and Majewski
19Mutations in paediatric gliblastoma Sequenced the exomes of 48 paediatric GBM samples, found:Somatic mutations in the H3.3-ATRX-DAXX chromatin remodelling pathway in 44% of tumoursRecurrent mutations in H3F3A, which encodes the replication-independent histone 3 variant H3.3 in 31% of tumours
20Applications (III) Quantitative biology of complex systems New high-throughput technologies in functional genomics: ChIP-Seq, RNA-Seq, ChIA-PET, RIP-Seq, …From single-gene measurements, to thousands of probes on arrays, to profiles covering all 3B bases of the genomeImportant systems: Stem cells, Cancer, Infectious diseases…
21Outline Overview of Next-Generation Sequencing (NGS) Applications ChallengesSolutions
22High-throughput Sequencing 36bp X 20MX 8 lanes20096 Gbases600 Gbases2 X 150bp X 250MX 8 lanes200 Human Genomes in 1 run!!!2013
23Big Data 2013 2 X 10 TBytes 1 TBytes Image files Intensity files Reads + qualities
24Big Data 2013 2 X 10 TBytes 1 TBytes Intensity files Reads + qualities From: Alexandre MontpetitSubject: news from IlluminaDate: 4 June, :15:16 PM EDTTo: Guillaume BourqueDe Mark Van Oene (vp Illumina ventes): dans la prochaine annee on doit s'attendre a 2x plus de reads en 2x moins de temps (et 2x plus longs)Ca cause probleme?Alex20132 X 10 TBytes1 TBytesIntensity filesReads + qualities12 TBytes240 TBytes25 TB of raw data / month300 TB of raw data / year
25Large NGS project Cancer project with whole genome data: 500 tumors 500 matched-normalvs125 TB raw125 TB raw500 X 3 lanes = 500 X 250GB500 X 3 lanes = 500 X 250GB
26DNA bases sequenced at the Innovation Center 12 HiSeqs72 Trillions!0r 800 genomes at 30XDNA bases
28Biomedical research is built on data integration Your data
29Biomedical research is built on data integration 100XYour data
30Challenges NGS instruments generate TBs of data NGS instruments are getting faster, cheaper and will increasingly be found in small research labs and hospitalsData sharing and integration is critical in biomedical researchSequencing data represents sensitive private data and is identifiable
31Outline Overview of Next-Generation Sequencing (NGS) Applications ChallengesSolutions
32Nanuq software Has tracked data and meta-data for more than: 2.6 million sample aliquots,20,500 reagents,17,000 plates,140,000 tubes,Multiple platforms, technologies and workflows(sequencing, genotyping, microarray, etc.)3,900 external users
36Data processing issues We have many different projects all needing space and processing.We want to use the Compute Canada clusters for scalability but also to facilitate data distribution (we have >800 users).This brings uniformity problems:Different setups Hardware and SoftwareDifferent configurationsEtc.
37Our strategyWe wrote analyses pipelines to be easily configurable across clusters.Same code, one ini file to customize (we already have templates for 3 cluster sites)We install Linux modules readable by all on all these clusters so we know exactly what is available everywhereWe also deploy common genomes across sites.
39Canadian Epigenetics, Environment and Health Research Consortium (CEEHRC) ( )
40PORTal for the Analysis of Genetics and Genomics Experiments (PORTAGGE)
41ConclusionsNGS offers a variety of technologies and numerous exciting applicationsMany areas of NGS data analyses are still under active development (e.g. RNA-Seq)A major challenge is to ensure sufficient compute and storage capacities not to limit more advanced analysesNeed to work together to avoid duplication of efforts in installing tools but also to develop efficient ways to use HPC in biomedical research