Presentation is loading. Please wait.

Presentation is loading. Please wait.

Global Variation in Copy Number in the Human Genome Speaker: Yao-Ting Huang Nature, Genome Research, Genome Research, 2006.

Similar presentations


Presentation on theme: "Global Variation in Copy Number in the Human Genome Speaker: Yao-Ting Huang Nature, Genome Research, Genome Research, 2006."— Presentation transcript:

1 Global Variation in Copy Number in the Human Genome Speaker: Yao-Ting Huang Nature, Genome Research, Genome Research, 2006.

2 2 References Redon et al. Global variation in copy number in the human genome. Nature, 2006. Fiegler et al. Accurate and reliable high-throughput detection of copy number variation in the human genome. Genome Research, 2006. Komura et al. Genome-wide detection of human copy number variations using high-density DNA oligonucleotide arrays. Genome Research, 2006. Komura et al. Noise Reduction from genotyping microarrays using probe level information. In Silico Biology, 2006. Price et al. SW-Array: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative genome hybridization data. NAR, 2005.

3 3 Copy Number Variation (CNV) Copy Number Variation (CNV) is a DNA segment  with length at least 1kb and  presents at variable copy number compared with a reference genome. The cause of a CNV is speculated due to non- allelic homologous recombination.  CNVs may disrupt genes, alter gene dosage, and confer risk to complex diseases such as HIV-1.

4 4 Examples of CNVs (1) Duplication Paternal Copy # = 2 Maternal Copy # = 2 Offspring Copy # = 3 ATAATACATAATAC Paternal Copy # = 2 Maternal Copy # = 2 … … Offspring Copy # = 1 Deletion

5 5 Examples of CNVs (2) Offspring Copy # = 2 Offspring Copy # = 4 Hard to tell the actual type of a CNV even within a family. Paternal Copy # = 3 Maternal Copy # = 3 Mendelian inheritance, deletion, duplication.

6 6 Use of Two Array Platforms (1) Whole Genome TilePath array (93.7% of euchromatin); (2) Affymetrix 500K SNP array.

7 7 Results There are a total of 1,447 CNVs identified and merged from these two arrays.  913 CNVs from tiling array and 980 CNVs from SNP genotyping array.  These CNVs cover 360Mb (12%) of the human genome. The mean sizes of CNVs are 341kb in tiling array and 206 kb in SNP array.  The use of large insert clones (~170kb) on tiling array tends to overestimate the size of CNV.

8 8 Strength and Weakness of these Two Arrays The 500k SNP array is better for detecting smaller CNVs.  The tiling array has more power than SNP array in segmental-duplicated region.

9 9 Location of CNVs Types of SequencesWGTP CNVs500K CNVs RefSeq (~25,000 genes)2,5611,139 OMIN (1,961 genes)251112 Ultra-conserved elements (481 elements) 4816 Conserved non-coding elements 116,67859,397 CNVs are preferentially located outside of genes and ultra-conserved elements.

10 10 Other Results 48% of gaps in the human genome assembly are flanked or overlapped by CNVs. 24% of 1,447 CNVs are associated with segmental duplications.  A portion of segmental duplications are CNVs and thus will not be fixed in the human genome. 12% of 1,447 CNVs are validated by locus- specific quantitative assay (e.g., quantitative PCR).

11 11 Linkage Disequilibrium between bi- allelic CNVs and Tag SNPs Linkage disequilibrium between bi-allelic CNVs and flanking SNPs can guide the selection of tag SNPs.  e.g, the copy number of CNV 1 can be predicted by SNP 2.  A single SNP array is sufficient to detect both SNP and CNV. SNP 1 SNP 2 CNV 1 SNP 3 SNP 4 CACopy # = 2CC ACCopy # = 3GC CACopy # = 2CT CCCopy # = 3GC AACopy # = 2CC

12 12 Linkage Disequilibrium between bi- allelic CNVs and Tag SNPs Linkage disequilibrium between bi-allelic CNVs and flanking SNPs can guide the selection of tag SNPs.  e.g., Suppose SNP 2 is selected as tag SNP. SNP 1 SNP 2 CNV 1 SNP 3 SNP 4 CACopy # = 2CC ACCopy # = 3GC CACopy # = 2CT CCCopy # = 3GC AACopy # = 2CC SNP 2 A Copy # = 2 SNP 2 C Copy # = 3

13 13 Results of Linkage Disequilibrium around bi-allelic CNVs 51% of CNVs in non-African populations have tag SNPs, whereas only 22% of CNVs in African population can be tagged.  Duplications would generate linkage disequilibrium at acceptor locus instead of donor locus.  The Phase I HapMap project has a paucity of SNPs in segmental-duplicated regions, where their CNVs are enriched.  Given false-positive CNVs inside and the uncertainty of CNV boundary, these results are bias (Conrad et al, Nat. Genet., 2006).

14 14 Linkage Disequilibrium around multi- allelic CNVs Linkage disequilibrium between multi-allelic CNVs and each flanking SNP are computed by square of Pearson’s correlation coefficient.  No SNPs with strong linkage disequilibrium are found.  Mistakes in comparing bi-allelic SNP with multi-allelic CNV. SNP 1 CNV 1 SNP 2 CCopy # = 0C ACopy # = 1G CCopy # = 2C CCopy # = 3G ACopy # = 1C SNP 1 C or A Copy # = 0, 1, 2, or 3

15 15 Lunch Break - Method Intensity preprocessing CNV detection Copy number inference

16 16 Intensity Preprocessing The signal intensity could be skewed due to  length of restriction enzyme fragment,  GC content of the probe sequence,  GC content of the restriction fragment, or  Affinity differences of different SNP genotypes (e.g., AA, AC, CC). Probe selection, noise reduction, and normalization are done at this stage (Komura et al, In Silico Biology, 2006).

17 17 CNV Detection For each pair of samples, we can test the relative intensity ratio at each SNP position. SNP position 1 2 3 4 5 6 7 8 9 10 …45…65 … 0 -2 2 1 Log 2 intensity ratio No copy number change Relative gain of copy Relative loss of copy

18 18 CNV Detection CNV is detected by finding clusters of sufficiently high (or low) ratios. SNP position 1 2 3 4 5 6 7 8 9 10 …45…65 … 0 -2 2 1 Log 2 intensity ratio

19 19 CNV Detection The intensity ratios at all SNPs can be regarded as a sequence of real numbers.  We seek for a consecutive subsequence of maximum sum. 0, 0.54, 1.21, 0.26, 2.34, …, 0, 0.1, -1.43, -0.2, …, -2.4, -2.6, -1.83 Log 2 intensity ratio SNP position

20 20 CNV Detection A dynamic programming algorithm called SW- Array is used to find the subsequence (NAR, 2005).  This algorithm has been proposed by Bentley in 1984. 0, 0.54, 1.21, 0.26, 2.34, …, 0, 0.1, -1.43, -0.2, …, -2.4, -2.6, -1.83 P 1 P 2 P 3 P 4 P 5 …

21 21 Copy Number Inference These clusters implies a putative CNV.  But we still don’t know the exact copy number. SNP position 1 2 3 4 5 6 7 8 9 10 …45…65 … 0 -2 2 1 Log 2 intensity ratio

22 22 Pairwise Comparison for All Samples The above algorithm is repeated for each pair of samples. Sample a / Sample b

23 23 Copy Number Inference The largest group of samples with the same copy number is called a diploid group.  This diploid group is used as a reference representing two copies.  They assume the mutation events are rare, and thus two copies should present highest frequency in the population.

24 24 Steps of Copy Number Inference

25 25 Copy Number Inference Samples c, d, and e are the largest group.

26 26 Copy Number Inference The copy numbers of samples a and b are inferred by comparing its intensity ratio with the average ratio of the diploid group.

27 27 Concluding Remarks The authors identify 1,447 CNVs using whole genome tiling and SNP genotyping arrays.  Given the low resolution of their arrays and flawed methods, I believe JJ’s results should be much more promising.  Linkage disequilibrium between CNVs and SNPs requires more sophisticated statistics and algorithms.


Download ppt "Global Variation in Copy Number in the Human Genome Speaker: Yao-Ting Huang Nature, Genome Research, Genome Research, 2006."

Similar presentations


Ads by Google