Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Data Tsunami in Biomedical Research

Similar presentations

Presentation on theme: "The Data Tsunami in Biomedical Research"— Presentation transcript:

1 The Data Tsunami in Biomedical Research
Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University June 5th, 2013

2 Next-generation sequencing (NGS)
Stein, Genome Biol. 2010

3 Falling cost of sequencing
DeWitt, Nat. Biotechnol. 2012

4 Sequencing human genomes
2001 2011 2013 (?) 1000 Genomes Project The Human Genome Your Genome ~ 3 Billion $ ~ $ $

5 Outline Overview of Next-Generation Sequencing (NGS) Applications
Challenges Solutions

6 Sequencing Revolution
Sanger sequencing Next-Generation sequencing Metzker, Nat. Rev. Genet. 2010 100s of reactions… 10000s of base pairs… Millions of reactions! Billions of base pairs!

7 High-throughput Sequencing
36bp X 20M X 8 lanes 2009 6 Gbases 600 Gbases 2 X 150bp X 250M X 8 lanes 200 Human Genomes in 1 run!!! 2013

8 NGS Technology Comparison
instrument Pacbio Ion Torrent 454 Illumina SOLiD Method Single-molecule in real-time Ion semiconductor Pyrosequencing synthesis Ligation Read length 3kb average 200 bp 700 bp 50 to 250 bp 50+35 or bp Error type indel substitution A-T bias single-Pass Error rate % 13 ~1 ~0.1 Reads per run 35000–75000 up to 4M 1M up to 3.2G 1.2 to 1.4G Time per run 30 minutes to 2 hours 2 hours 24 hours 1 to 10 days, 1 to 2 weeks Cost per 1 million bases (in US$) $2 $1 $10 $0.05 to $0.15 $0.13 Advantages Longest read length. Fast. Less expensive equipment. Fast. Long read size. Fast. high sequence yield, cost, accuracy Low cost per base. Disadvantages Low yield at high accuracy. Equipment can be very expensive. Homopolymer errors. Runs are expensive. Homopolymer errors. Equipment can be very expensive. Slower than other methods, read length, longevity of the plateform

9 Genome Canada > $915M investment and > $900M in co-funding
100s Large-scale genomics projects 5 Innovation centers

10 Outline Overview of Next-Generation Sequencing (NGS) Applications
Challenges Solutions

11 Applications (I) De novo sequencing
From the human genome… To all model organisms… To all relevant organisms (e.g. extreme genomes)… To “all” organisms?

12 Human Genome 3 Billion DNA base pairs (bp)
Two human genomes are ~99.9% identical There are about ~3M bp differences between you and me Some of these differences explain variation in: Disease susceptibility Differences in drug metabolism

13 Applications (II) Genome re-sequencing Genetic disorders
Cancer genome sequencing Map genomic structural variations across individuals Genealogy and migration Agricultural crops The Cancer Genome Atlas 1000 Genomes Project

14 Exome sequencing for Mendelian disease
“… about one-half to one-third (~3,000) of all known or suspected Mendelian disorders (for example, cystic fibrosis and sickle cell anaemia) have been discovered. However, there is a substantial gap in our knowledge about the genes that cause many rare Mendelian phenotypes.” “Accordingly, we can realistically look towards a future in which the genetic basis of all Mendelian traits is known, …”

15 Exome sequencing

16 Cancer genome sequencing
Can obtain a full catalogue of mutations

17 Michael Stromberg,

18 Mutations in paediatric gliblastoma
Jabado, Pfister and Majewski

19 Mutations in paediatric gliblastoma
Sequenced the exomes of 48 paediatric GBM samples, found: Somatic mutations in the H3.3-ATRX-DAXX chromatin remodelling pathway in 44% of tumours Recurrent mutations in H3F3A, which encodes the replication-independent histone 3 variant H3.3 in 31% of tumours

20 Applications (III) Quantitative biology of complex systems
New high-throughput technologies in functional genomics: ChIP-Seq, RNA-Seq, ChIA-PET, RIP-Seq, … From single-gene measurements, to thousands of probes on arrays, to profiles covering all 3B bases of the genome Important systems: Stem cells, Cancer, Infectious diseases…

21 Outline Overview of Next-Generation Sequencing (NGS) Applications
Challenges Solutions

22 High-throughput Sequencing
36bp X 20M X 8 lanes 2009 6 Gbases 600 Gbases 2 X 150bp X 250M X 8 lanes 200 Human Genomes in 1 run!!! 2013

23 Big Data 2013 2 X 10 TBytes 1 TBytes Image files Intensity files
Reads + qualities

24 Big Data 2013 2 X 10 TBytes 1 TBytes Intensity files Reads + qualities
From: Alexandre Montpetit Subject: news from Illumina Date: 4 June, :15:16 PM EDT To: Guillaume Bourque De Mark Van Oene (vp Illumina ventes): dans la prochaine annee on doit s'attendre a 2x plus de reads en 2x moins de temps (et 2x plus longs) Ca cause probleme? Alex 2013 2 X 10 TBytes 1 TBytes Intensity files Reads + qualities 12 TBytes 240 TBytes 25 TB of raw data / month 300 TB of raw data / year

25 Large NGS project Cancer project with whole genome data: 500 tumors
500 matched-normal vs 125 TB raw 125 TB raw 500 X 3 lanes = 500 X 250GB 500 X 3 lanes = 500 X 250GB

26 DNA bases sequenced at the Innovation Center
12 HiSeqs 72 Trillions! 0r 800 genomes at 30X DNA bases


28 Biomedical research is built on data integration
Your data

29 Biomedical research is built on data integration
100X Your data

30 Challenges NGS instruments generate TBs of data
NGS instruments are getting faster, cheaper and will increasingly be found in small research labs and hospitals Data sharing and integration is critical in biomedical research Sequencing data represents sensitive private data and is identifiable

31 Outline Overview of Next-Generation Sequencing (NGS) Applications
Challenges Solutions

32 Nanuq software Has tracked data and meta-data for more than:
2.6 million sample aliquots, 20,500 reagents, 17,000 plates, 140,000 tubes, Multiple platforms, technologies and workflows(sequencing, genotyping, microarray, etc.) 3,900 external users

33 Standardized analysis pipelines
Methylation Analysis report ChIP-Seq Analysis report RNA-Seq Analysis report

34 Data center at the Innovation Center
> 1200 cores > 2 PB disk > 5 PB tape

35 Need more! UdeS Mammouth – cores McGill Guillimin – cores

36 Data processing issues
We have many different projects all needing space and processing. We want to use the Compute Canada clusters for scalability but also to facilitate data distribution (we have >800 users). This brings uniformity problems: Different setups Hardware and Software Different configurations Etc.

37 Our strategy We wrote analyses pipelines to be easily configurable across clusters. Same code, one ini file to customize (we already have templates for 3 cluster sites) We install Linux modules readable by all on all these clusters so we know exactly what is available everywhere We also deploy common genomes across sites.

38 Usage on Compute Canada

39 Canadian Epigenetics, Environment and Health Research Consortium (CEEHRC)
( )

40 PORTal for the Analysis of Genetics and Genomics Experiments (PORTAGGE)

41 Conclusions NGS offers a variety of technologies and numerous exciting applications Many areas of NGS data analyses are still under active development (e.g. RNA-Seq) A major challenge is to ensure sufficient compute and storage capacities not to limit more advanced analyses Need to work together to avoid duplication of efforts in installing tools but also to develop efficient ways to use HPC in biomedical research

42 Acknowledgements IT team Terrance Mcquilkin
Marc-André Labonté Genevieve Dancausse Andras Frankel Alexandru Guja Analysis team Louis Letourneau Mathieu Bourgey Maxime Caron Gary Lévesque Robert Eveleigh Francois Lefebvre Johanna Sandoval Pascale Marquis Development team Nathalie Émond David Bujold Francois Cantin Catherine Côté Burak Demirtas Daniel Guertin Louis Dumond Joseph Francois Korbuly Marc Michaud Thuong Ngo EDCC team David Morais (UdeS) Carol Gauthier (UdeS) Bryan Caron (McGill) Alain Veilleux (UdeS) ME Rousseau (McGill)

43 Questions?

Download ppt "The Data Tsunami in Biomedical Research"

Similar presentations

Ads by Google