Presentation is loading. Please wait.

Presentation is loading. Please wait.

Human Cancer Genome Project Computational Systems Biology of Cancer: (II)

Similar presentations


Presentation on theme: "Human Cancer Genome Project Computational Systems Biology of Cancer: (II)"— Presentation transcript:

1 Human Cancer Genome Project Computational Systems Biology of Cancer: (II)

2 Human Cancer Genome Project Bud Mishra Professor of Computer Science, Mathematics and Cell Biology ¦ Courant Institute, NYU School of Medicine, Tata Institute of Fundamental Research, and Mt. Sinai School of Medicine

3 Human Cancer Genome Project The New Synthesis Genome Evolution Selection perturbed pathways micro-environment epigenomics transcriptomics proteomic metabolomics signaling genetic instability Part-lists, Annotation, Ontologies DNA RNA Protein TranscriptionTranslation Genotype Phenotype

4 Human Cancer Genome Project Is the Genomic View of Cancer Necessarily Accurate ? “If I said yes, that would then suggest that that might be the only place where it might be done which would not be accurate, necessarily accurate. It might also not be inaccurate, but I'm disinclined to mislead anyone.” –US Secretary of Defense, Mr. Donald Rumsfeld, Once again quoted completely out of context.

5 Human Cancer Genome Project Cancer Initiation and Progression Genomics (Mutations, Translocations, Amplifications, Deletions) Epigenomics (Hyper & Hypo-Methylation) Transcriptomics (Alternate Splicing,  RNA) Proteomics (Synthesis, Post-Translational Modification, Degradation) Signaling Proliferation, Motility, Immortality, Metastasis, Signaling

6 Human Cancer Genome Project Mishra’s Mystical 3M’s Rapid and accurate solutions –Bioinformatic, statistical, systems, and computational approaches. –Approaches that are scalable, agnostic to technologies, and widely applicable Promises, challenges and obstacles— Measure Mine Model

7 Human Cancer Genome Project “Measure” What we can quantify and what we cannot

8 Human Cancer Genome Project Microarray Analysis of Cancer Genome Representations are reproducible samplings of DNA populations in which the resulting DNA has a new format and reduced complexity. –We array probes derived from low complexity representations of the normal genome –We measure differences in gene copy number between samples ratiometrically –Since representations have a lower nucleotide complexity than total genomic DNA, we obtain a stronger specific hybridization signal relative to non-specific and noise Normal DNA Normal LCR Tumor DNA Tumor LCR Label Hybridize

9 Human Cancer Genome Project Minimizing Cross Hybridization (Complexity Reduction)

10 Human Cancer Genome Project A1 A2 A3 B1 B2 B3 C1 C2 C3 Copy Number Fluctuation

11 Human Cancer Genome Project Critical Innovations Data Normalization and Background Correction for Affy-Chips –10K, 100K, 500K (Affy); Generalized RMA –Multi-Experiment-Based Probe-Characterization (Kalman + EM) A novel genome segmenter algorithm –Empirical Bayes Approach; Maximum A Posteriori (MAP) –Generative Model (Hierarchical, Heteroskedastic) –Dynamic Programming Solution Cubic-Time; Linear-time Approximation using Beam-Search Heuristic Single Molecule Technologies –Optical and Nanotechnologies –Sequencing: SMASH –Epigenomics –Transcriptomics

12 Human Cancer Genome Project Background Correction & Normalization

13 Human Cancer Genome Project Oligo Arrays: SNP genotyping Given 500K human SNPs to be measured, select mers that over lap each SNP location for Allele A. –Select another mers corresponding to SNP Allele B. –Problem : Cross Hybridization DNA 25-mers

14 Human Cancer Genome Project Using SNP arrays to detect Genomic Aberrations Each SNP “probeset” measures absense/presence of one of two Alleles. If a region of DNA is deleted by cancer, one or both alleles will be missing! If a region of DNA is duplicated/amplified by cancer, one or both alleles will be amplified. Problem : Oligo arrays are noisy.

15 Human Cancer Genome Project 90 humans, 1 SNP (A=0.48) Allele A Allele B

16 Human Cancer Genome Project 90 humans, 1 SNP (A=0.24) Allele B Allele A

17 Human Cancer Genome Project 90 humans, 1 SNP (A=0.96) Allele B Allele A

18 Human Cancer Genome Project Background Correction & Normalization Consider a genomic location L and two “similar” nucleotide sequences s L,x and s L,y starting at that location in the two copies of a diploid genomes… –E.g., they may differ in one SNP. –Let  x and  y be their respective copy numbers in the whole genome and all copies are selected in the reduced complexity representation. The gene chip contains four probes p x 2 s L,x ; p y 2 s L,y ; p x’, p y’ :2 G. –After PCR amplification, we have some K x ¢  x amount of DNA that is complementary to the probe p x, etc.K' ( ¼ K’ x ) amount of DNA that is additionally approximately complementary to the probe p x.

19 Human Cancer Genome Project Normalize using a Generalized RMA I’ = U -  n – [   n 2 -  N(0,1) (a’/b’)/  N(0,1) (a’/b’)] £ {(1 +  ’ B  n /  N(0,1) (a’/b’)} -1 + [b  n /B  n ] )] £ {(1 +  N(0,1) (a’/b’)/(  ’ B  n )} -1, –Where a’ = U-  n -   n 2 ; b’ =  n, and –b  n =  [I i,j – U +  n ]  N(0,1) ([I i,j – U +  n ] ) –B  n =   N(0,1) ([I i,j – U +  n ] )

20 Human Cancer Genome Project Background Correction & Normalization If the probe has an affinity  x, then the measured intensity is can be expressed as [K x  x + K’]  x +noise = [  x + K’/K x ]  ’ x + noise –With Exp[  +   a multiplicative logNormal noise, [  +   an additive Gaussian noise, and  ’ x = K x  x an amplified affinity. A more general model: I x = [  x + K’/K x ]  ’ x e  +   +  

21 Human Cancer Genome Project Mathematical Model In particular, we have four values of measured intensities: I x = [  x  ’ x + N x ]e   +  +  2 I x’ = [N x ] e   +  +  2 I y = [  y  ’ y + N y ] e   +  +  2 I y’ = [N y ] e   +  +  2

22 Human Cancer Genome Project Bioinformatics: Data modeling Good news: For each 25-bp probe, the fluorescent signal increases linearly with the amount of complementary DNA in the sample (up to some limit where it saturates). Bad news: The linear scaling and offset differ for each 25-bp probe. Scaling varies by factors of more than 10x. Noise : Due to PCR & cross hybridization and measurement noise.

23 Human Cancer Genome Project Scaling & Offset differ Scaling varies across probes: –Each 25-bp sequence has different thermodynamic properties. Scaling varies across samples: –The scanning laser for different samples may have different levels. –The starting DNA concentrations may differ; PCR may amplify differently. Offset varies across probes: –Different levels of Cross Hybridization with the rest of the Genome. Offset varies across samples: –Different sample genomes may differ slightly (sample degradation; impurities, etc.)

24 Human Cancer Genome Project Linear Model + Noise

25 Human Cancer Genome Project Noise minimization

26 Human Cancer Genome Project Final Data Model

27 Human Cancer Genome Project MLE using gradients

28 Human Cancer Genome Project Data Outliers Our data model fails for few data points (“bad probes”) –Soln (1): Improve the model… –Soln (2): Discard the outliers –Soln (3): Alternate model for the outliers… Weight the data approprately.

29 Human Cancer Genome Project Outlier Model

30 Human Cancer Genome Project Problem with MLE: No unique maxima

31 Human Cancer Genome Project Scaling of MLE estimate

32 Human Cancer Genome Project Segmentation to reduce noise The true copy number (Allele A+B) is normally 2 and does not vary across the genome, except at a few locations (breakpoints). Segmentation can be used to estimate the location of breakpoints and then we can average all estimated copy number values between each pair of breakpoints to reduce noise.

33 Human Cancer Genome Project Allelic Frequencies: Cancer & Normal

34 Human Cancer Genome Project Allelic Frequencies: Cancer & Normal

35 Human Cancer Genome Project Segmentation & Break-Point Detection

36 Human Cancer Genome Project Algorithmic Approaches Local Approach –Change-point Detection (QSum, KS-Test, Permutation Test) Global Approach –HMM models –Wavelet Decomposition Bayesian & Empirical Bayes Approach –Generative Models (One- or Multi-level Hierarchical) –Maximum A Posteriori

37 Human Cancer Genome Project HMM Model with a very high degree of freedom, but not enough data points. Small Sample statistics a Overfitting, Convergence to local maxima, etc.

38 Human Cancer Genome Project HMM, finally… Model with a very high degree of freedom, but not enough data points. Small Sample statistics a Overfitting, Convergence to local maxima, etc. 2 ¸ 3 · 1

39 Human Cancer Genome Project HMM, last time We will simply model the number of break-points by a Poisson process, and lengths of the aberrational segments by an exponential process. Two parameter model: p b & p e =2  2 pbpb 1-p b 1-p e pepe Advantages: 1.Small Number of parameters. Can be optimized by MAP estimator. (EM has difficulties). 2.Easy to model deviation from Markvian properties (e.g., polymorphisms, power-law, Polya’s urn like process, local properties of chromosomes, etc.)

40 Human Cancer Genome Project Generative Model Amplification, c=4Amplification, c=3 Deletion, c=0Deletion, c=1 Breakpoints, Poisson, p b Segmental Length, Exponential, p e Copy number, Empirical Distribution Noise, Gaussian, , 

41 Human Cancer Genome Project A reasonable choice of priors yields good segmentation.

42 Human Cancer Genome Project A reasonable choice of priors yields good segmentation.

43 Human Cancer Genome Project A MAP (Maximum A Posteriori) Estimators Priors: –Deletion + Amplification Data: –Priors + Noise Goal: Find the most plausible hypothesis of regional changes and their associated copy numbers Generalizes HMM:The prior depends on two parameters p e and p b. –p e is the probability of a particular probe being “normal”. –p b is the average number of intervals per unit length. (pe,pb) max at (0.55,0.01)

44 Human Cancer Genome Project Likelihood Function The likelihood function for first n probes: L( h i 1,  1, …, i k,  k i ) = Exp(-p b n) (p b n) k £ (2   2 ) (-n/2)  i=1 n Exp[-(v i -  j ) 2 /2  2 ] £ p e (#global) (1-p e ) (#local) –Where i k = n and i belongs to the j th interval. –Maximum A Posteriori algorithm (implemented as a Dynamic Programming Solution) optimizes L to get the best segmentation L( h i* 1,  1, …, i* k,  k i )

45 Human Cancer Genome Project Dynamic Programming Algorithm Generalizes Viterbi and Extends. Uses the optimal parameters for the generative model: Adds a new interval to the end: h i 1,  1, …, i k,  k i ± h i k+1,  k+1 i = h i 1,  1, …, i k,  k, i k+1,  k+1 i Incremental computation of the likelihood function: – Log L( h i 1,  1, …, i k,  k, i k+1,  k+1 i ) = –Log L( h i 1,  1, …, i k,  k i ) + new-res./2  2 – Log(p b n) +(i k+1 – i k ) Log (2  2 ) – (i k+1 – i k ) [ I global Log p e + I local Log(1 – p e )]

46 Human Cancer Genome Project Prior Selection: F criterion For each break we have a T 2 statistic and the appropriate tail probability ( p value) calculated from the distribution of the statistic. In this case, this is an F distribution. The best (p e,p b ) is the one that leads to the maximum min p -value. (pe,pb) max at (0.55,0.01)

47 Human Cancer Genome Project Segmentation Analysis

48 Human Cancer Genome Project 13q13.113q31.3 CGH Explorer v.2.43 DNAcopy GLAD vMAP Olshen, AB et al. Biostatistic s 5 : Lingjaerde, OC et al. Bioinformatics 21 : Hupe, P et al. Bioinformatics 20 : Daruwala et al. Proc Natl Acad Sci U S A Comparison of chromosome 13 tumor using 4 different segmentation algorithm Comparison of chromosome 13 tumor using 4 different segmentation algorithm

49 Human Cancer Genome Project Comparative Analysis: BAC Array

50 Human Cancer Genome Project Comparative Analysis: Nimblegen

51 Human Cancer Genome Project Comparative Analysis: Affy 10K

52 Human Cancer Genome Project Simulated Data Array CGH simulations and an “ROC analysis” –Using the same scheme as Lai et al. Weil R. Lai, Mark D. Johnson, Raju Kucherlapati, and Peter J. Park (2005), “Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data,” Bioinformatics, 21(19): Segmented by Vmap and DNAcopy Vmap algorithm was tested at 11 segmentation Pvalues of: 0.1, , 10 -2, 10 -3, 10 -4, …, DNAcopy algorithm was tested at 9 segmentation alpha values of:.9,.5,.1, 10 -2, 10 -3, 10 -4, …, Analysis by Alex Pearlman et al. (2006)

53 Human Cancer Genome Project VMAP

54 DNACopy

55

56 Log ratio Prostate Tumor Gains and Losses Genome view of 19K BAC CGH Prostate Tumor Gains and Losses Genome view of 19K BAC CGH

57 Human Cancer Genome Project Normal 1,2,3 Tumor1 Tumor2 Tumor3 Proximal breakpoints were identical for T1 and T3. Distal breakpoints overlapped for T1, T2, and T3. Segmentation of Multi-BAC Events On Chromosome 13

58 Human Cancer Genome Project Further Improvement We employed a hierarchical Bayesian model in which global false discovery rates can be calculated using the different levels of the model. Noise processes are also estimated using the appropriate global parameters.

59 Human Cancer Genome Project Specific Features of the Model We build a model in which, given the region segmentations, we assume that the copy numbers I j = region j, (1 · j · k) in that regions are mutually independent Gaussian X i,j » N (  j,  j 2 ), (1 · i · n j ) random variables with mean  j and variance  j 2. We further assume that each copy region mean parameter  j is in one of a small number of ‘states’ 2 {1,…,S} with respective probabilities,  1, …,  S of being in state s.  j is in state s (with probability  s ) if it has a Gaussian distribution with state mean  s and state variance  s 2. States serve to characterize regions. The state means and variances are the hyperparameters of the model.

60 Human Cancer Genome Project Implementation: Dynamic Programming Given the hyperparameters, we segment regions using a dynamic programming approach. This consists in constructing probe regions as follows: – After the (j-1) st region has been constructed: A) we choose the next two contiguous regions to the right of those already constructed by optimizing the corresponding log likelihood, subject to the condition that the p-value of the t-statistic distinguishing between these two (aforementioned) regions is above a given threshold. B) Having chosen these (aforementioned) regions, the probe regions already constructed, contiguous to them, may also need to be altered.

61 Human Cancer Genome Project Segmentation (ROMA,chr3)

62 Human Cancer Genome Project S*M*A*S*H Single Molecule Approaches to Sequencing by Hybridization ~Extensions to Optical Mapping~

63 Human Cancer Genome Project S*M*A*S*H Genomic DNA is carefully extracted from small number of cells of an organism (e.g., human) in normal or diseased states. (Fig 1 shows a cancer cell to be studied for its oncogeneomic characterization.) Fig 1

64 Human Cancer Genome Project S*M*A*S*H LNA probes of length 6 – 8 nucleotides are hybridized to dsDNA (double-stranded genomic DNA) in a test tube (Fig 2) and the modified DNA is stretched on a 1” x 1” chip that has microfluidic channels manufactured on its surface. These surfaces have been chemically treated to create a positive charge. Fig 2 DNA samples are prepared for analysis with LNA probes and restriction enzymes.

65 Human Cancer Genome Project S*M*A*S*H Since DNA is slightly negatively charged, it adheres to the surface as it flows along these channels and stretches out. Individual molecules range in size from 0.3 – 3 million base pairs in length. Next, bright emitters are attached to the probes on the surface and the molecules are imaged (Fig 3). Fig 3

66 Human Cancer Genome Project S*M*A*S*H A restriction enzyme 1 is added to break the DNA at specific sites. Since DNA molecules are under slight tension, the cut fragments of DNA relax like entropic springs, leaving small visible gaps corresponding to the positions of the restriction site (Fig 4). 1. A restriction enzyme is a highly specific molecular scissor that recognizes short nucleotide sequences and cuts the DNA at only those recognition sites. Fig 4

67 Human Cancer Genome Project S*M*A*S*H The DNA is then stained with a fluorogen (Fig 5) and reimaged. The two images are combined to create a composite image suggesting the locations of a specific short word (e.g., probes) within the context of a pattern of restriction sites. Fig 5

68 Human Cancer Genome Project S*M*A*S*H –The intensity of the light emitted by the dye at one frequency provides a measure of the length of the DNA fragments. –The intensity of the light emitted by the bright-emitters on probes provides an intensity profile for locations of the probes. Images of each DNA molecule are then converted into ideograms, where the restriction sites are represented by a tall rectangle and probe sites by small circles (Fig 6). Fig 6

69 Human Cancer Genome Project S*M*A*S*H The steps above are repeated for all possible probe compositions (modulo reverse complementarity). Sutta software then uses the data from all such individual ideograms to create an assembly of the haplotypic ordered restriction maps with approximate probe locations superimposed on the map. ATAT TATC ATCA TCAT CATA ATATCATAT Fig 7

70 Human Cancer Genome Project S*M*A*S*H Local clusters of overlapping words are combined by Sutta’s PSBH (positional sequencing by hybridization) algorithm to overlay the inferred haplotypic sequence on top of the restriction map (Fig 7). ATAT TATC ATCA TCAT CATA ATATCATAT Fig 7

71 Human Cancer Genome Project Gapped Probes Mixing ‘solid’ bases with `wild-card’ bases: –E.g., xx*x**x*xx (10-4-mers) or xx*x****x*xx (12-6-mers) An ‘wild-card’ base –Universal: In terms of its ability to form base pairs with the other natural DNA/RNA bases. –Applications in primers and in probes for hybridization Examples: –The naturally occurring base hypoxanthine, as its ribo- or 2'- deoxyribonucleoside –2'-deoxyisoinosine –7-deaza-2'-deoxyinosine –2-aza-2'-deoxyinosine

72 Human Cancer Genome Project Simulation Results Probe Map Assumptions: –For single DNA molecules: Probe location Standard Deviation = 240 bases; Data coverage per probe map = 50x; Probe hybridization rate = 30%, and false positive rate of 10 probes per megabase, uniformly distributed. –Analytically estimation of the average error rate in the probe consensus map: Probe location SD = 60 bases; False Positive rate < 2.4%; False Negative rate < 2.0%.

73 Human Cancer Genome Project Simulation Results UNGAPPEDGAPPED

74 Human Cancer Genome Project Simulation Results Simulation based on non-random sequences from the human genome: 96 blocks of 1 Kb (from chromosome 1) concatenated together along with its in silico restriction map. –Error summary for the gapped probe pattern xx*x **** x*xx: Error count excluding repeats or near repeats: 0.32bp / 10Kb –There is no error due to incorrect rearrangements. –There is no loss of information at haplotypic level. –Assembly failed in 2 of 96 blocks of 1kb = 2.1% failure rate (out of memory).

75 Human Cancer Genome Project GENomic conTIG Gentig uses a purely Bayesian Approach. –It models all the error processes in the prior. –FAST: It initially starts with a conservative but fast pairwise overlap configuration, computed efficiently using Geometric Hashing. –ACCURATE: It iteratively combines pairs of maps or map contigs, while optimizing the likelihood score subject to a constraint imposed by a false- positive constraint. –It has special heuristics to handle non-local errors.

76 Human Cancer Genome Project HAPTIG: HAPlotypic conTIG Candida Albicans The left end of chromsome-1 of the common fungus Candida Albicans (being sequenced by Stanford). You can clearly see 3 polymorphisms: –(A) Fragment 2 is of size 41.19kb (top) vs 38.73kb (bottom). –(B) The 3rd fragment of size 7.76kb is missing from the top haplotype. –(C)The large fragment in the middle is of size 61.78kb vs 59.66kb. FAST & ACCURATE BAYESIAN ALGORITHM

77 Human Cancer Genome Project Lambda DNA with probes 10  m

78 Human Cancer Genome Project 500 nm A Fig. A : Four AFM images of lambda DNA with PNA probes hybridized to the distal recognition site, located 6,900 bp or 2.28 microns from the end (green arrow). Non-specifically bound probes indicated by the red arrows. Z- scale is +/- 1.5 nm.

79 Human Cancer Genome Project E. coli Figure 3. Two optical images of E coli K12 genomic DNA after restriction digestion with 6-cutter restriction enzyme Xho 1 and hybridization with an 8-mer PNA probe. Bound probes are indicated by blue arrows and non- specifically bound probes by the red arrows. Scale bar shown is 10 micron.

80 Human Cancer Genome Project Discussions Q&A…

81 Human Cancer Genome Project Answer to Cancer “If I know the answer I'll tell you the answer, and if I don't, I'll just respond, cleverly.” –US Secretary of Defense, Mr. Donald Rumsfeld.

82 Human Cancer Genome Project To be continued… Break…


Download ppt "Human Cancer Genome Project Computational Systems Biology of Cancer: (II)"

Similar presentations


Ads by Google