DNA Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

DNA Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University School of Medicine 03 - 25 – 2008 GEMS Course: M 21-621 Computational Statistical Genetics

Four Questions What is Copy Number ? What is Copy Number ? What can Copy Number tell us? What can Copy Number tell us? How to measure/quantify Copy Number? How to measure/quantify Copy Number? How to analyze Copy Number? How to analyze Copy Number?

What is Copy Number ? Gene Copy Number The gene copy number (also "copy number variants" or CNVs) is the amount of copies of a particular gene in the genotype of an individual. Recent evidence shows that the gene copy number can be elevated in cancer cells. For instance, the EGFR copy number can be higher than normal in Non-small cell lung cancer. …Elevating the gene copy number of a particular gene can increase the expression of the protein that it encodes. From Wikipedia www.wikipedia.org

DNA Copy Number A Copy Number Variant (CNV) represents a copy number change involving a DNA fragment that is ~1 kilobases or larger. From Nature Reviews Genetics, Feuk et al. 2006 DNA Copy Number ≠ DNA Tandem Repeat Number (e.g. microsatellites) <10 bases DNA Copy Number ≠ RNA Copy Number RNA Copy Number = Gene Expression Level DNA transcription mRNA Copy Number is the amount of copies of a particular fragment of nucleic acid molecular chain. It refers to DNA Copy Number in most publications.

What can Copy Number tell us? Genetic Diversity/Polymorphisms - restriction fragment length polymorphism (RFLP) - amplified fragment length polymorphism (AFLP) - random amplification of polymorphic DNA (RAPD) - variable number of tandem repeat (VNTR; e.g., mini- and microsatellite) - single nucleotide polymorphism (SNP) - presence/absence of transportable elements … - structural alterations (e.g., deletions, duplications, inversions … ) - DNA copy number variant (CNV) Association with phenotypes/diseases genes/genetic factors

Genetic Alterations in Tumor Cells (DNA Copy Number Changes) Homologous repeats Segmental duplications Chromosomal rearrangements Duplicative transpositions Non-allelic recombinations …… Normal cell Tumor cells deletion amplification CN=0 CN=1 CN=2 CN=3 CN=4 CN=2

How to measure/quantify Copy Number? Quantitative Polymerase Chain Reaction (Q-PCR) : DNA Amplification (dNTPs, primers, Taq polymerase, fluorescent dye) PCR less CN amplification less DNA low fluorescent intensity more CN amplification more DNA high fluorescent intensity (one fragment each time) Microarray : DNA Hybridization (dNTPs, primers, Taq polymerase, fluorescent dye) PCR less CN amplification less DNA arrayed probes low intensities more CN amplification more DNA arrayed probes high intensities (multiple/different fragments, mixed pool) Hybridization

SNP Array: From Image to Copy Number Tumor: red intensity Normal: green intensity Red < Green: Deletion (CN<2) Red > Green: Amplification (CN>2) Red = Green: No Alteration (CN=2) more DNA copy number more DNA hybridization higher intensity

Array CGH : From Image to Copy Number TumorNormal Affymetrix Mapping 250K Sty- I chip ~250K probe sets ~250K SNPs CN=1 CN=0 CN>2 CN=2 probe set (24 probes) Deletion Amplification more DNA copy number more DNA hybridization higher intensity

How to Analyze Copy Number?

General Procedures for Copy Number Analysis Finished chips (scanner) Raw image data [.DAT files] (experiment info [.EXP]) (image processing software) Probe level raw intensity data [.CEL files] Background adjustment, Normalization, Summarization Summarized intensity data Raw copy number (CN) data [log ratio of tumor/normal intensities] Significance test of CN changes Estimation of CN Smoothing and boundary determination Concurrent regions among population Amplification and deletion frequencies among populations Association analysis Preprocessing : chip description file [.CDF]

Background Adjustment/Correction Reduces unevenness of a single chip Makes intensities of different positions on a chip comparable Before adjustment After adjustment Corrected Intensity (S’) = Observed Intensity (S) – Background Intensity (B) For each region i, B(i) = Mean of the lowest 2% intensities in region i AffyMetrix MAS 5.0

Eliminates non-specific hybridization signal Obtains accurate intensity values for specific hybridization Background Adjustment/Correction PM only, PM-MM, Ideal MM, etc. quartet probe set sense or antisense strands 25 oligonucleotide probes

Normalization Reduces technical variation between chips Makes intensities from different chips comparable Before normalization After normalization Base Line Array (linear); Quantile Normalization;Contrast Normalization; etc. S – Mean of S S’ = STD of S S’ ~ N(0,1 )

Combines the multiple probe intensities for each probe set to produce a summarized value for subsequent analyses. Summarization Average methods: PM only or PM-MM, allele specific or non-specific Model based method : Li & Wong, 2001 Gene Expression Index

Raw Copy Number Data S : Summarized raw intensity S’ : Log transformation, S’ = log 2 (S) Raw CN: Log ratio of tumor / normal intensities CN = S’ tumor - S’ normal = log 2 (S tumor /S normal ) Pair design S normal = S of the paired normal sample Group design S normal = average S of the group of normal samples before Log transformation S after Log transformation Log(S) Raw CN

Individual Level Analysis Individual Level Analysis Analysis for each individual sample (or each sample pair)  Smoothing  Significance test of CN amplification and deletion  Boundary finding (smoothing and segmentation)  CN estimation

Smoothing via Sliding Window ….. … …...... …… …….. … …...... …… ….. …… ….. Window 1 Window 2 Window 3 Window 4 Window 5 Window 6 Window 7 Window 8 Window 9 Window 10 Window N Window k ……….. Each window (k) contains 30 consecutive SNPs (k, k+1, k+2, k+3, …, k+29)

Smoothing (sliding window=30 snps) Smoothing (sliding window=30 snps) Affymetrix Illumina Chrom. 7 Mbp CN Mbp Chrom. 7 CN Mbp CN Mbp CN

Significance Test of CN Changes An Example Significance Test of CN Changes An Example

Sliding Window Smoothing CN Mbp

Normalization CN Mbp SD Mbp

P-value calculation SD Mbp -log P Mbp

Calculate FDR for each window -log FDR Mbp -log P Mbp

Select window (FDR < 0.05) CN Mbp -log FDR Mbp

Another Example Intensities and Raw CNs, Chr. 1 (Piar#101) Black: Normal, Red: Tumor, Green: Tumor- Normal

Significance Test for Copy Number Changes: -log(p) values, TSP data, chr. 1, pair#101 Window-based t test Window size = 0.5 Mbp (~30 SNPs); N = SNP number in window Mean CN of window t = X N ~ t (df=N -1) SD of widow -log(p) Window Position (Mbp)

Segmentation (break chrom. into CN-homologous pieces) BioConductor R Packages (www.bioconductor.org) GLAD package, adaptive weights smoothing (AWS) method DNAcopy package, circular binary segmentation method

CN Estimation: Hidden Markov Model (HMM) CNAT(www.affymetrix.com); dChip (www.dchip.org) ; CNAG (www.genome.umin.jp) CN=? log ratio … SNP_i SNP_i+1 SNP_i+2 SNP_i+3 SNP_i+4 … position hidden status (unknown CN ) observed status (raw CN = log ratio of intensities) CN estimation: finding a sequence of CN values which maximizes the likelihood of observed raw CN. Algorithm: Viterbi algorithm (can be Iterative) Information/assumptions below are needed Background probabilities: Overall probabilities of possible CN values. P(CN=x); x=0,1,2,3,4,…, n (usually,n<10) Transition probabilities: Probabilities of CN values of each SNP conditional on the previous one. P(CN_i+1=x i | CN_i=x j ); x=0,1,2,3,4,…, or n Emission probabilities: Probabilities of observed raw CN values of each SNP conditional on the hidden/unknown/true CN status. P(log ratio<x|CN=y)=f(x|CN=y); x=one of real numbers; y=0,1,2,3,4, …, or n

HMM Estimation of CN for Chr. 1 (Piar#101) Black: Normal Intensities, Red: Tumor Intensities, Green: Tumor- Normal Blue: HMM estimated CNs in Tumor Tissue CN=2CN=1 CN=4 CN=3

Population Level Analysis Population Level Analysis Analysis for the whole group (or sub-group) of samples  Overall significance test  Amplification and deletion frequencies summarization  Common/concurrent region finding

Raw CN Changes of Chr. 14 (average over ~400 pairs )

Genome-wide Raw Copy Number Changes (sliding window plot, averaged over ~400 pairs )

Sliding Window Test of Significance of CN Changes -log(p) values, based on ~ 400 pairs

Visualization of Concurrent Regions of Chr. 14 (~400 pairs) positions samples

Software Affymetrix Chips (www.affymetrix.com)www.affymetrix.com Illumina Chips (www.illumina.com)www.illumina.com CNAT(www.affymetrix.com);www.affymetrix.com dChip (www.dchip.org) ; CNAG (www.genome.umin.jp)www.genome.umin.jp GenePattern www.broad.mit.edu/cancer/software/genepattern/www.broad.mit.edu/cancer/software/genepattern/ BioConductor R Packages (www.bioconductor.org)www.bioconductor.org GLAD package, adaptive weights smoothing (AWS) method DNAcopy package, circular binary segmentation method

DNA Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Similar presentations

Presentation on theme: "DNA Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DNA Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Similar presentations

Presentation on theme: "DNA Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University."— Presentation transcript:

Similar presentations

About project

Feedback