DETECTING CNV BY EXOME SEQUENCING Fah Sathirapongsasuti Biostatistics, HSPH.

Slides:



Advertisements
Similar presentations
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
We processed six samples in triplicate using 11 different array platforms at one or two laboratories. we obtained measures of array signal variability.
High Throughput Sequencing
Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
Bioinformatics lectures at Rice University Li Zhang Lecture 10: Networks and integrative genomic analysis-2 Genome instability and DNA copy number data.
Yanxin Shi 1, Fan Guo 1, Wei Wu 2, Eric P. Xing 1 GIMscan: A New Statistical Method for Analyzing Whole-Genome Array CGH Data RECOMB 2007 Presentation.
1000 Genomes SV detection Boston College Chip Stewart 24 November 2008.
I inherited What??? You and Your Genes: The Explosive New World of Genetics David Finegold, M.D.
1 Genome Rearrangements João Meidanis São Paulo, Brazil December, 2004.
High Throughput Sequencing
HL7 Clinical Sequencing Symposium Oncology Use Cases Ellen Beasley, Ph.D.September 14, 2011 VP, Ion Bioinformatics.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
RExPrimer Pongsakorn Wangkumhang, M.Sc. Biostatistics and Informatics Laboratory, Genome Institute, National Center for Genetic Engineering and Biotechnology.
1 Genetic Variability. 2 A population is monomorphic at a locus if there exists only one allele at the locus. A population is polymorphic at a locus if.
Capture / Resequencing Data Handling and Analysis
Constitutional (germ-line) variants in hereditary conditions
Detection of structural variants and copy number alterations in cancer: from computational strategies to the discovery of chromothripsis in neuroblastoma.
Next-Generation Sequencing
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
Supplemental data 2. Breast cancer primary tumor, metastasis and xenograft Total copy number gain (green), loss (red) and unchanged (black) for primary.
1 Commentary 1.Do not get too worried about "methods" and details. I fully expect there to be concepts and techniques that you simply are not going to.
Considerations for Analyzing Targeted NGS Data Exome Tim Hague, CTO.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
Identification of Copy Number Variants using Genome Graphs
____ __ __ _______Birol et al :: AGBT :: 7 February 2008 A NOVEL APPROACH TO IMPROVE THE NOISE IN DETECTING COPY NUMBER VARIATIONS USING OLIGONUCLEOTIDE.
Cancer genomics Yao Fu March 4, Cancer is a genetic disease In the early 1970’s, Janet Rowley’s microscopy studies of leukemia cell chromosomes.
Computational Laboratory: aCGH Data Analysis Feb. 4, 2011 Per Chia-Chin Wu.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.
Supplemental Figure 1. Bias-corrected NGS bioinformatics strategies. Paired-end DNA sequencing reveals the sequence of the genomic clone, the sample ID.
CGH Data BIOS Chromosome Re-arrangements.
Calling Somatic Mutations using VarScan
Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment.
Canadian Bioinformatics Workshops
Global Variation in Copy Number in the Human Genome Speaker: Yao-Ting Huang Nature, Genome Research, Genome Research, 2006.
Canadian Bioinformatics Workshops
DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health National Cancer Institute Frederick National Laboratory is a federally funded research.
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Canadian Bioinformatics Workshops
Big Data in Genomics, Diagnostics, and Precision Medicine
Interpreting exomes and genomes: a beginner’s guide
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Variant Calling Chris Fields
High-throughput sequencing analyses
RNA-Seq analysis in R (Bioconductor)
From: Dependable and Efficient Clinical Utility of Target Capture-Based Deep Sequencing in Molecular Diagnosis of Retinitis Pigmentosa Invest. Ophthalmol.
Ane Y. Schmidt, Thomas v. O. Hansen, Lise B
Whole-exome sequencing for RH genotyping and alloimmunization risk in children with sickle cell anemia by Stella T. Chou, Jonathan M. Flanagan, Sunitha.
A Targeted High-Throughput Next-Generation Sequencing Panel for Clinical Screening of Mutations, Gene Amplifications, and Fusions in Solid Tumors  Rajyalakshmi.
Assessing Copy Number Alterations in Targeted, Amplicon-Based Next-Generation Sequencing Data  Catherine Grasso, Timothy Butler, Katherine Rhodes, Michael.
Volume 150, Issue 4, Pages (April 2016)
Linking Genetic Variation to Important Phenotypes
Genomic alterations in breast cancer cell line MDA-MB-231.
Copy-number alterations in an archival breast cancer sample.
Eric Samorodnitsky, Jharna Datta, Benjamin M
Volume 150, Issue 4, Pages (April 2016)
Diverse abnormalities manifest in RNA
Canadian Bioinformatics Workshops
Variant Calling Chris Fields
Development of a Novel Next-Generation Sequencing Assay for Carrier Screening in Old Order Amish and Mennonite Populations of Pennsylvania  Erin L. Crowgey,
Concordance between the genomic landscape identified by whole-exome sequencing of plasma cfDNA and tumor; DNA and recurrence of KDR/VEGFR2 oncogenic mutations.
The Variant Call Format
Presentation transcript:

DETECTING CNV BY EXOME SEQUENCING Fah Sathirapongsasuti Biostatistics, HSPH

~85% of the disease-causing mutations occur in protein coding regions (exome) Exome constitutes 1% of the genome About 160, ,000 exons Time-saving and cost-effective Capturing protein coding portion of the genome Exome Sequencing

3 FASTA/FASTQ SAM/BAM VCF/BCF SAM/BAM Coverage

4

5

6

7

8

9 Pileup Standard format for mapped data, position summaries seq1 272 T 24,.$.....,,.,.,...,,,.,..^+.<<<+;<<<<<<<<<<<=<;<;7<& seq1 273 T 23,.....,,.,.,...,,,.,..A<<<;<<<<<<<<<3<=<<<;<<+ seq1 274 T 23,.$....,,.,.,...,,,.,...7<7;<;<<<<<<<<<=<;<;<<6 seq1 275 A 23,$....,,.,.,...,,,.,...^l.<+;9*<<<<<<<<<=<<:;<<<< seq1 276 G 22...T,,.,.,...,,,.,....33;+<<7=7<<7<&<<1;<<6< seq1 277 T ,,.,.,.C.,,,.,..G.+7<;<<<<<<<&<=<<:;<<&< seq1 278 G ,,.,.,...,,,.,....^k.%38*<<;<7<<7<=<<<;<<<<< seq1 279 C 23 A..T,,.,.,...,,,.,.....;75&<<<<<<<<<=<<<9<<:<< Seq. Pos. Ref. Len.AlignmentQuality

10

11 Variant Call Format ##format=PCFv1 ##fileDate= ##source=myImputationProgramV3.1 ##reference=1000GenomesPilot-NCBI36 ##phasing=partial #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA rs G A 29 0 NS=58;DP=258;AF=0.786;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51, T A 3 q10 NS=55;DP=202;AF=0.024 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65, rs A G,T 67 0 NS=55;DP=276;AF=0.421,0.579;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18, T NS=57;DP=257;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51, microsat1 G D4,IGA 50 0 NS=55;DP=250;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 ##format=PCFv1 ##fileDate= ##source=myImputationProgramV3.1 ##reference=1000GenomesPilot-NCBI36 ##phasing=partial #CHROM POS ID REF ALT QUAL FILTER INFO rs G A 29 0 NS=58;DP=258;AF=0.786;DB;H2 FORMAT NA00001 NA00002 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51

12

gains and losses of chunks of DNA sequences Sizes: 1kb-5Mb (Sanger’s CNV Project) Generally large chunks … Small gains/losses are called insertion/deletion (in-del) CNV Copy-Number Variation/Alteration Comparative Genomic Hybridisation Blue lines: individuals with two copies. Red line: individual with zero copy.

All techniques were developed for whole genome sequencing or targeted sequencing of one continuous region. Two approaches: Paired-End Methods (use insert size) Depth of Coverage Challenges of Exome Sequencing: Discontinuous search space Paired-end methods won’t work The only natural way to discretize the data is by exon Resolution is limited by distance between exons Non-uniform distribution of reads Exon capture probes have different efficiency CNV method specific for Exome Seq is needed

Min1 st QuMedMean3 rd QuMax 1231,9994,98129,21014,03020,900,000 CNV Resolution is limited by exome probe design

Treat one exon as a unit (variable length) Measure depth of coverage (average coverage) per exon Key assumptions: Number of reads over exons of certain size follows Poisson distribution Average coverage is directly proportional to the number of reads; i.e. average coverage = #reads * read length / exon length Depth of Coverage Approach

arbitrary cutoff Using the ratio of depth-of-coverage to detect CNV Specificity Sensitivity = Power Null: no CNV shiftAlt: CNV shift

Power to detect CNV depends on depth-of-coverage Deletion Duplication

It is generally harder to detect higher copy number as the variance increases linearly with the mean Deletion Duplication

Tumor sample is usually contaminated with normal cells Ratio will tend to 1, making it more difficult to detect CNV Have to estimate admixture rate prior to calling CNV otherwise power may be over/underestimated. Issue: Admixture 50% admixture

ExomeCNV Overview library(ExomeCNV) chr.list = c("chr19","chr20","chr21") suffix = ".coverage" prefix = " normal = read.all.coverage(prefix, suffix, chr.list, header=T) prefix = " tumor = read.all.coverage(prefix, suffix, chr.list, header=T) source(" biocLite("DNAcopy") install.packages("ExomeCNV")

Exome CNV Calling Method Idea: -Use ROC to determine optimum ratio cutoff for a given exon of certain length and coverage -Only make a call when enough power can be achieved Calculate log adjusted ratio Optimize cutoff based on read coverage, exon length, and estimated admixture rate Call CNV on each exon demo.logR = calculate.logR(normal, tumor) demo.eCNV = c() for (i in 1:length(chr.list)) { idx = (normal$chr == chr.list[i]) ecnv = classify.eCNV(normal=normal[idx,], tumor=tumor[idx,], logR=demo.logR[idx], min.spec=0.9999, min.sens=0.9999, option="spec", c=0.5, l=70) demo.eCNV = rbind(demo.eCNV, ecnv) } do.plot.eCNV(demo.eCNV, lim.quantile=0.99, style="idx", line.plot=F)

Merging exonic CNVs into segments Circular binary segmentation

Sequential merging: Use circular binary segmentation (CBS) algorithm to identify breakpoints, then use the above method to call CNV for segments Merge exon with no CNV call, normal call, or same CNV call Breakpoint Identification and Sequential Merging Call CNV on each exon Run CBS to merge exons into segments Call CNV on each segment Merge with previous CNV segments demo.cnv = multi.CNV.analyze(normal, tumor, logR=demo.logR, all.cnv.ls=list(demo.eCNV), coverage.cutoff=5, min.spec=0.99, min.sens=0.99, option="auc", c=0.5) do.plot.eCNV(demo.cnv, lim.quantile=0.99, style="bp", bg.cnv=demo.eCNV, line.plot=T)

Visualization by circo

JF Sathirapongsasuti, et al. (2011) Exome Sequencing-Based Copy- Number Variation and Loss of Heterozygosity Detection: ExomeCNV, Bioinformatics, 2011 Oct 1;27(19): Epub 2011 Aug 9. JF Sathirapongsasuti, et al. (2011) Exome Sequencing-Based Copy- Number Variation and Loss of Heterozygosity Detection: ExomeCNV, Bioinformatics, 2011 Oct 1;27(19): Epub 2011 Aug 9. Resources

Thank you …