Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.

Similar presentations


Presentation on theme: "The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University."— Presentation transcript:

1 The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University June 5 th, 2013

2 Next-generation sequencing (NGS) 2 Stein, Genome Biol. 2010

3 Falling cost of sequencing 3 DeWitt, Nat. Biotechnol. 2012

4 Sequencing human genomes 1000 Genomes Project ~ $ The Human Genome ~ 3 Billion $ Your Genome $ (?)

5 Outline Overview of Next-Generation Sequencing (NGS) Applications Challenges Solutions 5

6 Sequencing Revolution 6 Sanger sequencingNext-Generation sequencing Metzker, Nat. Rev. Genet s of reactions… 10000s of base pairs… Millions of reactions! Billions of base pairs!

7 High-throughput Sequencing 6 Gbases bp X 20M X 8 lanes Gbases 2 X 150bp X 250M X 8 lanes 200 Human Genomes in 1 run!!!

8 NGS Technology Comparison instrumentPacbioIon Torrent454IlluminaSOLiD Method Single-molecule in real-time Ion semiconductor PyrosequencingsynthesisLigation Read length3kb average200 bp700 bp50 to 250 bp or bp Error typeindel substitutionA-T bias single-Pass Error rate % 13~1~0.1 Reads per run35000–75000up to 4M1Mup to 3.2G1.2 to 1.4G Time per run 30 minutes to 2 hours 2 hours24 hours1 to 10 days,1 to 2 weeks Cost per 1 million bases (in US$) $2$1$10$0.05 to $0.15$0.13 Advantages Longest read length. Fast. Less expensive equipment. Fast. Long read size. Fast. high sequence yield, cost, accuracy Low cost per base. Disadvantages Low yield at high accuracy. Equipment can be very expensive. Homopolymer errors. Runs are expensive. Homopolymer errors. Equipment can be very expensive. Slower than other methods, read length, longevity of the plateform

9 Genome Canada 9 > $915M investment and > $900M in co-funding 100s Large-scale genomics projects 5 Innovation centers

10 Outline Overview of Next-Generation Sequencing (NGS) Applications Challenges Solutions 10

11 Applications (I) De novo sequencing – From the human genome… To all model organisms… To all relevant organisms (e.g. extreme genomes)… To “all” organisms? 11

12 Human Genome 12 3 Billion DNA base pairs (bp) Two human genomes are ~99.9% identical There are about ~3M bp differences between you and me Some of these differences explain variation in: – Disease susceptibility – Differences in drug metabolism – …

13 Applications (II) Genome re-sequencing – Genetic disorders – Cancer genome sequencing – Map genomic structural variations across individuals – Genealogy and migration – Agricultural crops – … Genomes Project The Cancer Genome Atlas

14 Exome sequencing for Mendelian disease 14 “… about one-half to one-third (~3,000) of all known or suspected Mendelian disorders (for example, cystic fibrosis and sickle cell anaemia) have been discovered. However, there is a substantial gap in our knowledge about the genes that cause many rare Mendelian phenotypes.” “Accordingly, we can realistically look towards a future in which the genetic basis of all Mendelian traits is known, …”

15 Exome sequencing 15

16 Cancer genome sequencing 16 Can obtain a full catalogue of mutations

17 Michael Stromberg, bioinformatics.ca

18 Mutations in paediatric gliblastoma 18 Jabado, Pfister and Majewski

19 Mutations in paediatric gliblastoma 19 Sequenced the exomes of 48 paediatric GBM samples, found: Somatic mutations in the H3.3- ATRX-DAXX chromatin remodelling pathway in 44% of tumours Recurrent mutations in H3F3A, which encodes the replication- independent histone 3 variant H3.3 in 31% of tumours

20 Applications (III) Quantitative biology of complex systems – New high-throughput technologies in functional genomics: ChIP-Seq, RNA-Seq, ChIA-PET, RIP-Seq, … – From single-gene measurements, to thousands of probes on arrays, to profiles covering all 3B bases of the genome – Important systems: Stem cells, Cancer, Infectious diseases… 20

21 Outline Overview of Next-Generation Sequencing (NGS) Applications Challenges Solutions 21

22 High-throughput Sequencing 6 Gbases bp X 20M X 8 lanes Gbases 2 X 150bp X 250M X 8 lanes 200 Human Genomes in 1 run!!!

23 Big Data 1 TBytes X 10 TBytes Intensity filesReads + qualities 70 TBytes Image files

24 Big Data 1 TBytes X 10 TBytes 12 TBytes 240 TBytes Intensity filesReads + qualities 25 TB of raw data / month 300 TB of raw data / year From: Alexandre Montpetit Subject: news from Illumina Date: 4 June, :15:16 PM EDT To: Guillaume Bourque De Mark Van Oene (vp Illumina ventes): dans la prochaine annee on doit s'attendre a 2x plus de reads en 2x moins de temps (et 2x plus longs) Ca cause probleme? Alex From: Alexandre Montpetit Subject: news from Illumina Date: 4 June, :15:16 PM EDT To: Guillaume Bourque De Mark Van Oene (vp Illumina ventes): dans la prochaine annee on doit s'attendre a 2x plus de reads en 2x moins de temps (et 2x plus longs) Ca cause probleme? Alex

25 Large NGS project Cancer project with whole genome data: 125 TB raw 500 X 3 lanes = 500 X 250GB 125 TB raw 500 tumors 500 matched-normal 500 X 3 lanes = 500 X 250GB vs

26 DNA bases sequenced at the Innovation Center 26 DNA bases 12 HiSeqs 72 Trillions! 0r 800 genomes at 30X

27 27 adventure.nationalgeographic.com

28 Biomedical research is built on data integration Your data

29 Biomedical research is built on data integration 100X Your data

30 Challenges NGS instruments generate TBs of data NGS instruments are getting faster, cheaper and will increasingly be found in small research labs and hospitals Data sharing and integration is critical in biomedical research Sequencing data represents sensitive private data and is identifiable 30

31 Outline Overview of Next-Generation Sequencing (NGS) Applications Challenges Solutions 31

32 Nanuq software 32 Has tracked data and meta-data for more than: 2.6 million sample aliquots, 20,500 reagents, 17,000 plates, 140,000 tubes, Multiple platforms, technologies and workflows(sequencing, genotyping, microarray, etc.) 3,900 external users

33 Standardized analysis pipelines 33 ChIP-Seq Analysis report RNA-Seq Analysis report Methylation Analysis report … … …… …

34 Data center at the Innovation Center 34 > 1200 cores > 2 PB disk > 5 PB tape

35 Need more! 35 McGill Guillimin – cores UdeS Mammouth – cores

36 Data processing issues We have many different projects all needing space and processing. We want to use the Compute Canada clusters for scalability but also to facilitate data distribution (we have >800 users). This brings uniformity problems: – Different setups Hardware and Software – Different configurations – Etc.

37 Our strategy We wrote analyses pipelines to be easily configurable across clusters. Same code, one ini file to customize (we already have templates for 3 cluster sites) We install Linux modules readable by all on all these clusters so we know exactly what is available everywhere We also deploy common genomes across sites.

38 Usage on Compute Canada 38

39 39 Canadian Epigenetics, Environment and Health Research Consortium (CEEHRC) $1.5M ( )

40 PORTal for the Analysis of Genetics and Genomics Experiments (PORTAGGE) 40

41 Conclusions NGS offers a variety of technologies and numerous exciting applications Many areas of NGS data analyses are still under active development (e.g. RNA-Seq) A major challenge is to ensure sufficient compute and storage capacities not to limit more advanced analyses Need to work together to avoid duplication of efforts in installing tools but also to develop efficient ways to use HPC in biomedical research

42 Acknowledgements IT team Terrance Mcquilkin Marc-André Labonté Genevieve Dancausse Andras Frankel Alexandru Guja Development team Nathalie Émond David Bujold Francois Cantin Catherine Côté Burak Demirtas Daniel Guertin Louis Dumond Joseph Francois Korbuly Marc Michaud Thuong Ngo Analysis team Louis Letourneau Mathieu Bourgey Maxime Caron Gary Lévesque Robert Eveleigh Francois Lefebvre Johanna Sandoval Pascale Marquis EDCC team David Morais (UdeS) Carol Gauthier (UdeS) Bryan Caron (McGill) Alain Veilleux (UdeS) ME Rousseau (McGill)

43 Questions? 43


Download ppt "The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University."

Similar presentations


Ads by Google