1 Estimating chromosomal copy number InCoB2007, Hong Kong, 30 August, 2007.

Slides:



Advertisements
Similar presentations
Lecture 2 Strachan and Read Chapter 13
Advertisements

What is an association study? Define linkage disequilibrium
Low-Level Copy Number Analysis CRMA v2 preprocessing Henrik Bengtsson Post doc, Department of Statistics, University of California, Berkeley, USA CEIT.
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Single Nucleotide Polymorphism Copy Number Variations and SNP Array Xiaole Shirley Liu and Jun Liu.
We processed six samples in triplicate using 11 different array platforms at one or two laboratories. we obtained measures of array signal variability.
Modeling sequence dependence of microarray probe signals Li Zhang Department of Biostatistics and Applied Mathematics MD Anderson Cancer Center.
Gene Expression Index Stat Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model.
Microarray Normalization
Recall that to be successful, all things must survive to reproduce
Cloning lab results Cloning the human genome Physical map of the chromosomes Genome sequencing Integrating physical and recombination maps Polymorphic.
Getting the numbers comparable
Comparative Genomic Hybridization (CGH). Outline Introduction to gene copy numbers and CGH technology DNA copy number alterations in breast cancer (Pollack.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.
Gene expression array and SNP array
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Dr Katie Snape Specialist Registrar in Genetics St Georges Hospital
Large-Scale Copy Number Polymorphism in the Human Genome J. Sebat et al. Science, 305:525 Luana Ávila MedG 505 Feb. 24 th /24.
Human Heredity. Human Chromosomes Humans have 46 total chromosomes Two Categories –Autosomes- first 22 pairs –Sex Chromosomes- 23rd pair, determine sex.
Copy-number estimation on the latest generation of high-density oligonucleotide microarrays Henrik Bengtsson (work with Terry Speed) Dept of Statistics,
Factors to Consider in Selecting a Genotyping Platform Elizabeth Pugh June 22, 2007.
Copy Number Variants: detection and analysis Manuel Ferreira & Shaun Purcell Boulder, 2009.
Data Type 1: Microarrays
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
DNA Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.
Affymetrix CytoScan HD array
A Single-Array Preprocessing Method for Estimating Full-Resolution Raw Copy Numbers from all Affymetrix Genotyping Arrays Henrik Bengtsson (MSc CS, PhD.
CS177 Lecture 10 SNPs and Human Genetic Variation
CZ5225: Modeling and Simulation in Biology Lecture 10: Copy Number Variations Prof. Chen Yu Zong Tel:
A Genome-wide association study of Copy number variation in schizophrenia Andrés Ingason CNS Division, deCODE Genetics. Research Institute of Biological.
Informative SNP Selection Based on Multiple Linear Regression
1 Commentary 1.Do not get too worried about "methods" and details. I fully expect there to be concepts and techniques that you simply are not going to.
Estimating chromosomal copy numbers from Affymetrix SNP & CN chips Henrik Bengtsson & Terry Speed Dept of Statistics, UC Berkeley September 13, 2007 "Statistics.
Low-Level Copy Number Analysis Part 1 - Background Henrik Bengtsson Post doc, Department of Statistics, University of California, Berkeley, USA CEIT Workshop.
Genomics Collaboration Senior Scientist
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Methods in genome wide association studies. Norú Moreno
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
Genotype Calling Jackson Pang Digvijay Singh Electrical Engineering, UCLA.
Identification of Copy Number Variants using Genome Graphs
____ __ __ _______Birol et al :: AGBT :: 7 February 2008 A NOVEL APPROACH TO IMPROVE THE NOISE IN DETECTING COPY NUMBER VARIATIONS USING OLIGONUCLEOTIDE.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Analysis of protein-DNA interactions with tiling microarrays
Allele Frequencies: Staying Constant Chapter 14. What is Allele Frequency? How frequent any allele is in a given population: –Within one race –Within.
Human Genomics. Writing in RED indicates the SQA outcomes. Writing in BLACK explains these outcomes in depth.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Microbial Genetics.  In bacteria genetic transfer (recombination) can happen three ways:  Transformation  Transduction  Conjugation  The result is.
Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry.
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Copy Number Analysis in the Cancer Genome Using SNP Arrays Qunyuan Zhang, Aldi Kraja Division of Statistical Genomics Department of Genetics & Center for.
Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment.
SECTION 5 - INHERITANCE National 4 & 5 – Multicellular Organisms.
Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data Rafael A. Irizarry Department of Biostatistics, JHU (joint.
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
Global Variation in Copy Number in the Human Genome Speaker: Yao-Ting Huang Nature, Genome Research, Genome Research, 2006.
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Global Variation in Copy Number in the Human Genome
Copy-number estimation using Robust Multichip Analysis - Supplementary materials for the aroma.affymetrix lab session Henrik Bengtsson & Terry Speed Dept.
Human Cells Human genomics
Volume 8, Issue 5, Pages (September 2014)
Genomic alterations in breast cancer cell line MDA-MB-231.
Getting the numbers comparable
What are they?? How do we use them?
Warm Up Complete Edpuzzle on pedigrees
What are they?? How do we use them?
Presentation transcript:

1 Estimating chromosomal copy number InCoB2007, Hong Kong, 30 August, 2007

2 Copy number variation (CNV) What is it? A form of human genetic variation: instead of 2 copies of each region of each chromosome (diploid), some people have amplifications or losses (> 1kb) in different regions –this doesn’t include translocations or inversions We all have such regions –the publicly available genome NA15510 has between 5 & 240 by various estimates –they are only rarely harmful (but rare things do happen)

3 Copy number variation Population genomics The genomes of two humans differ more in a structural sense than at the nucleotide level; a recent paper estimates that on average two of us differ by ~ Mb of genetic due to Copy Number Variation ~ 2.5 Mb due to Single Nucleotide Polymorphisms

4 Copy number variation As it relates to human disease Is responsible for a number of rare genetic conditions. For example, Down syndrome ( trisomy 21), Cri du chat syndrome (a partial deletion of 5p). Is implicated in complex diseases. For example:  CCL3L1 CN   HIV/AIDS susceptibility; also, some sporadic (non-inherited) CN variants are strongly associated with autism, while Tumors typically have a lot of chromosomal abnormalities, including recurrent CN changes.

5 Trisomy 21

6 Partial deletion of chr 5p

7 Large amplifications/losses can be seen by eye; smaller ones are hard to see

8 A cytogeneticist’s story “The story is about diagnosis of a 3 month old baby with macrocephaly and some heart problems. The doctors questioned a couple of syndromes which we tested for and found negative. Rather than continue this ‘shot in the dark’ approach, we put the case on an array and found a 2Mb deletion which notably deletes the gene NSD1 on chr 5, mutations in which are known to be cause Sotos syndrome. This is an overgrowth syndrome and fits with the macrocephaly. The bottom line is that we are able to diagnose quicker by this approach and delineate exactly the underlying genetic change.”

9 2Mb deletion Chromosome 5

10 NSD1

11 A lung cancer cell line vs matched normal lymphoblast, from Nannya et al Cancer Res 2005;65: Many tumors have gross CN changes

12 Research into gonad dysfunction: Human sex reversal 20% of 46,XY females have mutations in SRY 80% of 46,XY females unexplained! 90% of 46,XX males due to translocation SRY 10% of 46,XX males unexplained! Suggests loss of function and gain of function mutations in other genes may cause sex reversal. We’re looking at shared deletions.

13 Plan To introduce the Single Nucleotide Polymorphism (SNP) arrays, the probes, and the associated assays. Then I’ll discuss the first bioinformatic aspect of Copy Number (CN) analysis, what I call low-level analyses, then show one way of assessing the outcome. For simplicity I concentrate on Affymetrix arrays, called GeneChips , though similar considerations apply in whole or in part to some other array technologies, including Illumina.

14 Genomic DNA ATCGGTAGCCATTCATGAGTTACTA Perfect Match probe for Allele A ATCGGTAGCCATCCATGAGTTACTA Perfect Match probe for Allele B A SNP G TAGCCATCGGTA GTACTCAATGAT Affymetrix SNP chip terminology Genotyping: answering the question about the two copies of the chromosome on which the SNP is located: Is a sample AA (AA), AB (AG) or BB (GG) at this SNP?

Affymetrix GeneChip  1.28cm 6.4 million features/ chip 1.28cm 5 µ > 1 million identical 25 bp probes / feature * * * * * *

ng Genomic DNA RE Digestion Adaptor Ligation GeneChip ® Mapping Assay Overview Xba Fragmentation and Labeling PCR: One Primer Amplification Complexity Reduction AA BB AB Hyb & Wash

17 Principal low-level analysis steps Background adjustment and normalization at probe level These steps are to remove lab/operator/reagent effects Combining probe level summaries to probe set level summary: best done robustly, on many chips at once This is to remove probe affinity effects and discordant observations (gross errors/non-responding probes, etc) Possibly further rounds of normalization (probe set level) as lab/cohort/batch/other effects are frequently still visible Derive the relevant copy-number quantities Finally, quality assessment is an important low-level task.

18 AA TT AT Our preprocessing for total CN using SNP probe pairs (250K chip) Modification by H Bengtsson of a method due to A Wirapati developed some years ago for microsatellite genotyping; similar to the approach used by Illumina.

19 Background adjustment and normalization Outcome similar to that achieved by quantile normalization

20 Low-level analysis problems haven’t been solved once and for all; why? The feature size keeps  and so the # features/chip keeps  ; Fewer and fewer features are used for a given measurement, allowing more measurements to be made using a single chip These considerations all place more and more demands on the low-level analysis: to maintain the quality of existing measurements, and to obtain good new ones.

21 SNP probe tiling strategy TAGCCATCGGTA N SNP 0 position A / G GTACTCAATGAT* ATCGGTAGCCAT T ATCGGTAGCCAT C ATCGGTAGCCAT G ATCGGTAGCCAT A CATGAGTTACTA PM MM PM MM A A B B 0 Allele Central probe quartet

22 SNP probe tiling strategy, 2 TAGCCATCGGTA N SNP +4 Position A / G GTA C TCAATGATCAGCT* GTAGCCAT T GTAGCCAT C GTAGCCAT T CAT G AGTTACTAGTCG CAT C AGTTACTAGTCG CAT G AGTTACTAGTCG CAT C AGTTACTAGTCG PM MM PM MM A A B B +4 Allele +4 offset probe quartet

SNP probe tiling strategy, PM A MM A PM B MM B Central quartet Offset quartets This was repeated on the opposite strand giving 56 probes for the 10K chip. The 100K chip had 40 chosen from offsets and strands by performance. The 5.0 chip had 8 well chosen probes/SNP; no MMs. The current 6.0 chip has just 6: 3 replicates of a PMA and 3 of a PMB. Also, there are a large # of unreplicated non-polymorphic probes for CN inference.

24 What comes next? Using SNP chips to identify change in total copy number (i.e. CN ≠ 2) Outline a new method (CRMA) Evaluate and compare it with other methods Make some closing remarks on further issues

25 Copy-number estimation using Robust Multichip Analysis (CRMA) CRMA Preprocessing (probe signals) allelic crosstalk (or quantile) Total CNPM=PM A +PM B Summarization (SNP signals  ) log-additive PM only Post-processingfragment-length (GC-content) Raw total CNs R = Reference M ij = log 2 (  ij /  Rj ) chip i, probe j A few details are passed over. Ask me later if you care about them.

26 CRMA, 1 CRMA Preprocessing (probe signals) allelic crosstalk (or quantile) Total CNPM=PM A +PM B Summarization (SNP signals  ) log-additive PM only Postprocessingfragment-length (GC-content) Raw total CNs M ij = log 2 (  ij /  Rj ) Already briefly described.

27 CRMA, 2 CRMA Preprocessing (probe signals) allelic crosstalk (quantile) Total CNPM=PM A +PM B Summarization (SNPsignals  ) log-additive PM only Postprocessingfragment-length (GC-content) Raw total CNs M ij = log 2 (  ij /  Rj )  That’s it!

28 CRMA, 3 CRMA Preprocessing (probe signals) allelic crosstalk (quantile) Total CNsPM=PM A +PM B Summarization (SNP signals  ) log-additive PM only Postprocessingfragment-length (GC-content) Raw total CNs M ij = log 2 (  ij /  Rj ) log 2 (PM ijk ) = log 2  ij + log 2  jk +  ijk Fit using rlm

29 CRMA, 4a CRMA Preprocessing (probe signals) allelic crosstalk (quantile) Total CNPM=PM A +PM B Summarization (SNP signals  ) log-additive PM-only Postprocessingfragment-length (GC-content) Raw total CNs M ij = log 2 (  ij /  Rj ) 100K Longer fragments get less well amplified by PCR and so give weaker SNP signals

30 CRMA, 4b CRMA Preprocessing (probe signals) allelic crosstalk (quantile) Total CNPM=PM A +PM B Summarization (SNP signals  ) log-additive PM-only Postprocessingfragment-length (GC-content) Raw total CNs M ij = log 2 (  ij /  Rj ) 500K Longer fragments get less well amplified by PCR and so give weaker SNP signals

31 CRMA, 4c CRMA Preprocessing (probe signals) allelic crosstalk (quantile) Total CNPM=PM A +PM B Summarization (SNP signals  ) log-additive PM-only Postprocessingfragment-length (GC-content) Raw total CNs M ij = log 2 (  ij /  Rj ) 500K Longer fragments get less well amplified by PCR and so give weaker SNP signals

32 CRMA, 5 CRMA Preprocessing (probe signals) allelic crosstalk (quantile) Total CNPM=PM A +PM B Summarization (SNP signals  ) log-additive PM-only Postprocessingfragment-length (GC-content) Raw total CNs M ij = log 2 (  ij /  Rj ) Care required with the number and nature of Reference samples used

33 Summary comparison of 4 methods CRMAdChip (Li & Wong 2001) CNAG* (Nannya et al 2005) CNAT v4 (Affymetrix 2006) Preprocessing (probe signals) allelic crosstalk (quantile) quantilescalequantile Total CNPM=PM A +PM B PM=PM A +PM B MM=MM A +MM B PM=PM A +PM B “log-additive” PM-only Summarization (SNP signals  ) Log additive PM only Multiplicative PM-MM =A+B=A+B Post-processingfragment-length (GC-content ) fragment-length (GC-content) fragment-length (GC-content) Raw total CNs M ij = log 2 (  ij /  Rj )

34 Evaluation: how well can we differentiate between one and two copies? HapMap: Mapping 250K Nsp data 30 males and 29 females (no children; one bad data set) Chromosome X is known: Males (CN=1) & females (CN=2) 5,608 SNPs Classification rule: M ij < threshold  CN ij =1, otherwise CN ij =2. Number of calls: 59  5,608 = 330,872

35 Calling samples for SNP_A # males: 30 # females: 29 Call rule: If M i < threshold, a male Calling a male male: #True positives: 30 Calling a female male: #False positives : 5 TP rate: 30/30 = 100% FP rate: 5/29 = 17% M = log 2 (  /  R )

36 Receiver Operator Characteristic (ROC) increasing threshold FP rate TP rate

37 Single-SNP comparison: a random SNP TP rate FP rate

38 A non-differentiating SNP

39 Distribution (density) of TP rates when controlling for FP rate (5,608 SNPs) TP rate (correctly calling males male) FP rate: 1.0% (incorrectly calling females male) CNAT: 10% SNPs poor density

40 CRMA & dChip perform better for an average SNP (common threshold) Number of calls: 59  5,608 = 330,872 zoom in

41 Average across R SNPs non-overlapping windows threshold A false-positive (or real?!?)

42 Better detection rate when averaging (with risk of missing short regions) R=1 (no av) R=2 R=3 R=4

43 CRMA does a bit better than dChip CRMA dChip Control for FP rate: 1.0% CRMA:R=169.6% R=296.0% R=398.7% R=499.8% …

44 Comparing methods by “resolution” CRMA dChip CNAG* CNAT FP rate 1%

45 Several further bioinformatic issues Estimating copy number: needs calibration data Segmentation (of chromosomes into constant copy number regions): an HMM-like algorithm Analysing family CN data: a different HMM Incorporating non-polymorphic probes: independent HMM observations to be weighted and combined Dealing with mixed normal-abnormal samples Utilizing poor quality DNA samples Estimating allele-specific copy number ……and more

46 Some results using trios Data: one of seven trios, 250K, results from Jeremy Silver

47 Conclusions/comments Using chromosome X permits us to: –test how well a method detects deletions –compare methods –get a sense of resolution We plan to do further tests with known CN changes to see how well this generalizes We are working on some of the issues other mentioned There is room for contributions from you!

48 Available in aroma.affymetrix ("google it") “Infinite” number of arrays: 1- 1,000s Requirements: 1-2GB RAM Arrays: SNP, exon, expression, (tiling). Dynamic HTML reports Import/export to existing methods Open source: R Cross platform: Windows, Linux, Mac

49 Acknowledgements Henrik Bengtsson, UC Berkeley Andrew Sinclair & Howard Slater, MCRI Nusrat Rabbee, Genentech Simon Cawley, Francois Collin & Srinka Ghosh, Affymetrix Rafael Irizarry & Benilton Carvalho, Johns Hopkins Nancy Zhang, Stanford Jeremy Silver, WEHI

50 Thank you!