Big data challenges in personalized cancer medicine Bioinformatics activities in the Norwegian Cancer Genomics Consortium (NCGC) Sigve Nakken Postdoctoral fellow, Eivind Hovigs group Norwegian Cancer Genomics Consortium (NCGC) Department of Tumor Biology, ICR, OUS
Norwegian Cancer genomics Consortium (NCGC) Founded by oncologists and cancer scientists across the country (Tromsø, Trondheim, Bergen, Oslo) Contributing to and following the national priorization of ”Individualized cancer treatment based on the gene profile of the tumour” as the most important topic in cancer research Has obtained grants of 75 Mkr (≈ 10 MUSD) from the Research Council Industrial partners: OCC, PubGene, BergenBio Project divided into work packages WP4: Data handling and establishment of national infrastructure
NCGC sample cohorts Cancer type REK approvals Sequencing Samples Analysis Melanoma Approved Done 115 On-going Colon cancer 100 Multiple myeloma Lymphoma 76 Leukemia 41 Sarcoma - Prostate 75 Breast cancer Ovarian cancer Submitted
NCGC cancer genome sequencing Exome sequencing Goal: identify & characterize the acquired genetic changes in the tumor sample by massively parallel deep sequencing SNVs & Insertions/deletions Copy number aberrations Structural rearrangements
Cancer genome sequencing (II) Variant calling pipeline
Cancer genome sequencing (III) How deep should I sequence my tumor sample? (to detect a mutant subpopulation at X percent?) Biological complexity Tumor purity Ploidy Local CNAs Technical biases Uneven coverage (GC) PCR artefacts Sequencing quality/errors Oxidation (DNA extraction + library prep) Other Tumor-control mismatch
Somatic variant calling Two key components Read alignment – mapping each read to its proper position in the genome Mutation calling – quantify the likelihood of a true somatic mutation Best-practice workflows defined Still many different algorithms to choose from Need for benchmark
ICGC mutation benchmark Purpose: Assess concordance & accuracy of somatic SNV/indel calling among variant calling pipelines used in different research groups Evaluate impact of different algorithms (aligner, caller etc.) NCGC: optimize and verify running pipeline (“ICGC stamp”) Participants were given raw sequence reads from a medulloblastoma (MB99) genome (tumor + normal), ~40X coverage task: submit somatic indels + snvs Coordinated by CNAG, Barcelona (Ivo Gut’s lab) Weekly global telephone conferences BM1.2
SNVs – how well do we agree?
InDels – how well do we agree?
Verification of calls – GOLD set 300X sequencing of the same genome Six different pipelines called somatic SNVs and InDels SNVs with concordance of > 3 accepted SNVs with concordance < 3 and all indels reviewed manually
Accuracy – SNV/InDels
Impact of aligner-caller combination
Benchmark manuscript
Improved accuracy – SNVs/InDels EH_rev EH_rev
Interpretation of variants Which variants/genes are of functional relevance? Is my variant a frequent mutation? Which cancer types? Is my variant likely to alter the activity of the encoding protein? Is my variant known as a drug sensitivity marker? Which mutant genes are known drug targets? Annotation pipeline Variant calling Functional annotation Prioritization
Variants – phenotypic effect? Computational prediction of damaging variants Machine learning Numerous algorithms SIFT, PolyPhen2, MutationTaster, MutationAssessor, Provean, FATHMM, etc.. Challenge: many have been trained with Mendelian disease mutations Gain-of-function mutations hard to predict
Variants – clinical associations? Recent promising resources/data on clinically associated variants
Which genes are key drivers? Which genes show significantly more mutations than random expectation? Requires sophisticated modeling of the background mutation rates MutSigCV Which genes are enriched with functionally biased variants? IntoGen Lawrence at al., Nature (2013) Gonzalez-Perez at al., Nature Methods (2013)
NCGC – data trends
Mutational heterogeneity – across cancer types
Mutational heterogeneity – within cancer types CRC Melanoma
Functional heterogeneity
Mutational signatures Distinct mutational patterns (mutation types & sequence context) that reflect underlying mutational processes Mathematical framework to infer the k mutational signatures contributing to a cohort What is the relative contribution of each process in each sample? S1 – Alkylating agents (?) S2 – UV damage S3 - Aging
In progress/future plans Evaluation of more read aligners/variant callers Integration of improved calling of copy number aberrations Inference of clonal population structure Report pr. tumor case – QC, mutated cancer genes, actionable targets etc. Improved tools for visualization of results
Other activities
Acknowledgements NCGC ICGC Technical Validation group Principal investigators Department of Tumor Biology Leonardo Meza-Zepeda, Susanne Lorenz, Ola Myklebost Daniel Vodak, Ghislain Fournous, Lars Birger Aasheim, Eivind Hovig ICGC Technical Validation group