Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Systems Biology of Cancer:

Similar presentations

Presentation on theme: "Computational Systems Biology of Cancer:"— Presentation transcript:

1 Computational Systems Biology of Cancer:
(II) Computational Systems Biology of Cancer:

2 Professor of Computer Science, Mathematics and Cell Biology
Bud Mishra Professor of Computer Science, Mathematics and Cell Biology Courant Institute, NYU School of Medicine, Tata Institute of Fundamental Research, and Mt. Sinai School of Medicine

3 The New Synthesis DNA RNA Protein Genome Evolution Selection
Part-lists, Annotation, Ontologies DNA RNA Protein Transcription Translation Genotype Phenotype Genome Evolution Selection micro-environment epigenomics transcriptomics proteomic metabolomics signaling perturbed pathways genetic instability

4 Is the Genomic View of Cancer Necessarily Accurate ?
“If I said yes, that would then suggest that that might be the only place where it might be done which would not be accurate, necessarily accurate. It might also not be inaccurate, but I'm disinclined to mislead anyone.” US Secretary of Defense, Mr. Donald Rumsfeld, Once again quoted completely out of context.

5 Cancer Initiation and Progression
Genomics (Mutations, Translocations, Amplifications, Deletions) Epigenomics (Hyper & Hypo-Methylation) Transcriptomics (Alternate Splicing, mRNA) Proteomics (Synthesis, Post-Translational Modification, Degradation) Signaling Cancer Initiation and Progression Proliferation, Motility, Immortality, Metastasis, Signaling

6 Mishra’s Mystical 3M’s Rapid and accurate solutions
Bioinformatic, statistical, systems, and computational approaches. Approaches that are scalable, agnostic to technologies, and widely applicable Promises, challenges and obstacles— Measure Mine Model

7 What we can quantify and what we cannot
“Measure” What we can quantify and what we cannot

8 Microarray Analysis of Cancer Genome
Normal DNA Normal LCR Tumor DNA Tumor LCR Label Hybridize Representations are reproducible samplings of DNA populations in which the resulting DNA has a new format and reduced complexity. We array probes derived from low complexity representations of the normal genome We measure differences in gene copy number between samples ratiometrically Since representations have a lower nucleotide complexity than total genomic DNA, we obtain a stronger specific hybridization signal relative to non-specific and noise

9 Minimizing Cross Hybridization (Complexity Reduction)

10 Copy Number Fluctuation

11 Critical Innovations Data Normalization and Background Correction for Affy-Chips 10K, 100K, 500K (Affy); Generalized RMA Multi-Experiment-Based Probe-Characterization (Kalman + EM) A novel genome segmenter algorithm Empirical Bayes Approach; Maximum A Posteriori (MAP) Generative Model (Hierarchical, Heteroskedastic) Dynamic Programming Solution Cubic-Time; Linear-time Approximation using Beam-Search Heuristic Single Molecule Technologies Optical and Nanotechnologies Sequencing: SMASH Epigenomics Transcriptomics

12 Background Correction & Normalization

13 Oligo Arrays: SNP genotyping
Given 500K human SNPs to be measured, select mers that over lap each SNP location for Allele A. Select another mers corresponding to SNP Allele B. Problem : Cross Hybridization DNA 25-mers

14 Using SNP arrays to detect Genomic Aberrations
Each SNP “probeset” measures absense/presence of one of two Alleles. If a region of DNA is deleted by cancer, one or both alleles will be missing! If a region of DNA is duplicated/amplified by cancer, one or both alleles will be amplified. Problem : Oligo arrays are noisy.

15 90 humans, 1 SNP (A=0.48) Allele B Allele A

16 90 humans, 1 SNP (A=0.24) Allele B Allele A

17 90 humans, 1 SNP (A=0.96) Allele B Allele A

18 Background Correction & Normalization
Consider a genomic location L and two “similar” nucleotide sequences sL,x and sL,y starting at that location in the two copies of a diploid genomes… E.g., they may differ in one SNP. Let qx and qy be their respective copy numbers in the whole genome and all copies are selected in the reduced complexity representation. The gene chip contains four probes px 2 sL,x; py 2 sL,y; px’, py’ :2 G. After PCR amplification, we have some Kx ¢ qx amount of DNA that is complementary to the probe px, etc.K' (¼ K’x) amount of DNA that is additionally approximately complementary to the probe px.

19 Normalize using a Generalized RMA
I’ = U - mn – [a sn2 - fN(0,1)(a’/b’)/FN(0,1)(a’/b’)] £{(1 + b’ Bsn/FN(0,1)(a’/b’)}-1 + [bsn/Bsn] )] £{(1 + FN(0,1)(a’/b’)/(b’ Bsn)}-1, Where a’ = U-mn -a sn2; b’ = sn, and bsn = å [Ii,j – U + mn] fN(0,1)([Ii,j – U + mn] ) Bsn = å fN(0,1)([Ii,j – U + mn] )

20 Background Correction & Normalization
If the probe has an affinity fx, then the measured intensity is can be expressed as [Kx qx + K’] fx +noise = [qx + K’/Kx] f’x + noise With Exp[m1 + e s1], a multiplicative logNormal noise, [m2 + e s2] an additive Gaussian noise, and f’x = Kx fx an amplified affinity. A more general model: Ix = [qx + K’/Kx] f’x em1+e s1 + m2 + e s2

21 Mathematical Model Ix = [qx f’x + Nx]e m1 +e s1 +m2 + e s2
In particular, we have four values of measured intensities: Ix = [qx f’x + Nx]e m1 +e s1 +m2 + e s2 Ix’ = [Nx] e m1 +e s1 +m2 + e s2 Iy = [qy f’y + Ny] e m1 +e s1 +m2 + e s2 Iy’ = [Ny] e m1 +e s1 +m2 + e s2

22 Bioinformatics: Data modeling
Good news: For each 25-bp probe, the fluorescent signal increases linearly with the amount of complementary DNA in the sample (up to some limit where it saturates). Bad news: The linear scaling and offset differ for each 25-bp probe. Scaling varies by factors of more than 10x. Noise : Due to PCR & cross hybridization and measurement noise.

23 Scaling & Offset differ
Scaling varies across probes: Each 25-bp sequence has different thermodynamic properties. Scaling varies across samples: The scanning laser for different samples may have different levels. The starting DNA concentrations may differ; PCR may amplify differently. Offset varies across probes: Different levels of Cross Hybridization with the rest of the Genome. Offset varies across samples: Different sample genomes may differ slightly (sample degradation; impurities, etc.)

24 Linear Model + Noise

25 Noise minimization

26 Final Data Model

27 MLE using gradients

28 Data Outliers Our data model fails for few data points (“bad probes”)
Soln (1): Improve the model… Soln (2): Discard the outliers Soln (3): Alternate model for the outliers… Weight the data approprately.

29 Outlier Model

30 Problem with MLE: No unique maxima

31 Scaling of MLE estimate

32 Segmentation to reduce noise
The true copy number (Allele A+B) is normally 2 and does not vary across the genome, except at a few locations (breakpoints). Segmentation can be used to estimate the location of breakpoints and then we can average all estimated copy number values between each pair of breakpoints to reduce noise.

33 Allelic Frequencies: Cancer & Normal

34 Allelic Frequencies: Cancer & Normal

35 Segmentation & Break-Point Detection

36 Algorithmic Approaches
Local Approach Change-point Detection (QSum, KS-Test, Permutation Test) Global Approach HMM models Wavelet Decomposition Bayesian & Empirical Bayes Approach Generative Models (One- or Multi-level Hierarchical) Maximum A Posteriori

37 2 3 4 5 6 1 HMM Model with a very high degree of freedom, but not enough data points. Small Sample statistics a Overfitting, Convergence to local maxima, etc.

38 HMM, finally… Model with a very high degree of freedom, but not enough data points. Small Sample statistics a Overfitting, Convergence to local maxima, etc. ¸ 3 · 1 2

39 HMM, last time ¹ 2 1-pe pe =2 pb 1-pb
Advantages: Small Number of parameters. Can be optimized by MAP estimator. (EM has difficulties). Easy to model deviation from Markvian properties (e.g., polymorphisms, power-law, Polya’s urn like process, local properties of chromosomes, etc.) We will simply model the number of break-points by a Poisson process, and lengths of the aberrational segments by an exponential process. Two parameter model: pb & pe ¹ 2 1-pe pe =2 pb 1-pb

40 Generative Model Breakpoints, Poisson, pb
Segmental Length, Exponential, pe Copy number, Empirical Distribution Noise, Gaussian, m, s Amplification, c=4 Amplification, c=3 Deletion, c=0 Deletion, c=1

41 A reasonable choice of priors yields good segmentation.

42 A reasonable choice of priors yields good segmentation.

43 A MAP (Maximum A Posteriori) Estimators
Priors: Deletion + Amplification Data: Priors + Noise Goal: Find the most plausible hypothesis of regional changes and their associated copy numbers Generalizes HMM:The prior depends on two parameters pe and pb. pe is the probability of a particular probe being “normal”. pb is the average number of intervals per unit length. (pe,pb) max at (0.55,0.01)

44 £ (2 p s2)(-n/2)Õi=1n Exp[-(vi - mj)2/2s2] £ pe(#global)(1-pe)(#local)
Likelihood Function The likelihood function for first n probes: L(h i1, m1, …, ik, mk i) = Exp(-pb n) (pb n)k £ (2 p s2)(-n/2)Õi=1n Exp[-(vi - mj)2/2s2] £ pe(#global)(1-pe)(#local) Where ik = n and i belongs to the jth interval. Maximum A Posteriori algorithm (implemented as a Dynamic Programming Solution) optimizes L to get the best segmentation L(h i*1, m*1, …, i*k, m*k i)

45 Dynamic Programming Algorithm
Generalizes Viterbi and Extends. Uses the optimal parameters for the generative model: Adds a new interval to the end: h i1, m1, …, ik, mk i ± h ik+1, mk+1 i = h i1, m1, …, ik, mk, ik+1, mk+1 i Incremental computation of the likelihood function: – Log L(h i1, m1, …, ik, mk, ik+1, mk+1 i) = –Log L(h i1, m1, …, ik, mki) + new-res./2s2 – Log(pbn) +(ik+1 – ik) Log (2ps2) – (ik+1 – ik) [Iglobal Log pe + Ilocal Log(1 – pe)]

46 Prior Selection: F criterion
For each break we have a T2 statistic and the appropriate tail probability (p value) calculated from the distribution of the statistic. In this case, this is an F distribution. The best (pe,pb) is the one that leads to the maximum min p-value. (pe,pb) max at (0.55,0.01)

47 Segmentation Analysis

48 Comparison of chromosome 13 tumor using 4 different segmentation algorithm
vMAP Olshen, AB et al. Biostatistics 5: Lingjaerde, OC et al. Bioinformatics 21: 821-2 Hupe, P et al. Bioinformatics 20: Daruwala et al. Proc Natl Acad Sci U S A. 2004 DNAcopy CGH Explorer v.2.43 GLAD 13q13.1 13q31.3

49 Comparative Analysis: BAC Array

50 Comparative Analysis: Nimblegen

51 Comparative Analysis: Affy 10K

52 Simulated Data Array CGH simulations and an “ROC analysis”
Using the same scheme as Lai et al. Weil R. Lai, Mark D. Johnson, Raju Kucherlapati, and Peter J. Park (2005), “Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data,” Bioinformatics, 21(19): Segmented by Vmap and DNAcopy Vmap algorithm was tested at 11 segmentation Pvalues of: 0.1, , 10-2, 10-3, 10-4, …, DNAcopy algorithm was tested at 9 segmentation alpha values of: .9, .5, .1, 10-2, 10-3, 10-4, …, 10-7. Analysis by Alex Pearlman et al. (2006)


54 DNACopy


56 Prostate Tumor Gains and Losses Genome view of 19K BAC CGH
Log ratio

57 Segmentation of Multi-BAC Events On Chromosome 13
Proximal breakpoints were identical for T1 and T3. Distal breakpoints overlapped for T1, T2, and T3. Normal 1,2,3 Tumor1 Tumor2 Tumor3

58 Further Improvement We employed a hierarchical Bayesian model in which global false discovery rates can be calculated using the different levels of the model. Noise processes are also estimated using the appropriate global parameters.

59 Specific Features of the Model
We build a model in which, given the region segmentations, we assume that the copy numbers Ij = region j, (1 · j · k) in that regions are mutually independent Gaussian Xi,j» N(qj, sj2), (1 · i · nj) random variables with mean qj and variance sj2. We further assume that each copy region mean parameter qj is in one of a small number of ‘states’ 2 {1,…,S} with respective probabilities, p1, …, pS of being in state s. qj is in state s (with probability ps) if it has a Gaussian distribution with state mean qs and state variance ts2 . States serve to characterize regions. The state means and variances are the hyperparameters of the model.

60 Implementation: Dynamic Programming
Given the hyperparameters, we segment regions using a dynamic programming approach. This consists in constructing probe regions as follows: After the (j-1)st region has been constructed: A) we choose the next two contiguous regions to the right of those already constructed by optimizing the corresponding log likelihood, subject to the condition that the p-value of the t-statistic distinguishing between these two (aforementioned) regions is above a given threshold. B) Having chosen these (aforementioned) regions, the probe regions already constructed, contiguous to them, may also need to be altered.

61 Segmentation (ROMA,chr3)
Red = New EM algorithm (where it differs) Green = Basic heteroscedastic model

62 S*M*A*S*H ~Extensions to Optical Mapping~
Single Molecule Approaches to Sequencing by Hybridization ~Extensions to Optical Mapping~

63 S*M*A*S*H Genomic DNA is carefully extracted from small number of cells of an organism (e.g., human) in normal or diseased states. (Fig 1 shows a cancer cell to be studied for its oncogeneomic characterization.) Fig 1

64 S*M*A*S*H LNA probes of length 6 – 8 nucleotides are hybridized to dsDNA (double-stranded genomic DNA) in a test tube (Fig 2) and the modified DNA is stretched on a 1” x 1” chip that has microfluidic channels manufactured on its surface. These surfaces have been chemically treated to create a positive charge. Fig 2 DNA samples are prepared for analysis with LNA probes and restriction enzymes.

65 S*M*A*S*H Since DNA is slightly negatively charged, it adheres to the surface as it flows along these channels and stretches out. Individual molecules range in size from 0.3 – 3 million base pairs in length. Next, bright emitters are attached to the probes on the surface and the molecules are imaged (Fig 3). Fig 3

66 S*M*A*S*H A restriction enzyme1 is added to break the DNA at specific sites. Since DNA molecules are under slight tension, the cut fragments of DNA relax like entropic springs, leaving small visible gaps corresponding to the positions of the restriction site (Fig 4). 1. A restriction enzyme is a highly specific molecular scissor that recognizes short nucleotide sequences and cuts the DNA at only those recognition sites. Fig 4

67 S*M*A*S*H The DNA is then stained with a fluorogen (Fig 5) and reimaged. The two images are combined to create a composite image suggesting the locations of a specific short word (e.g., probes) within the context of a pattern of restriction sites. Fig 5

68 S*M*A*S*H The intensity of the light emitted by the dye at one frequency provides a measure of the length of the DNA fragments. The intensity of the light emitted by the bright-emitters on probes provides an intensity profile for locations of the probes. Images of each DNA molecule are then converted into ideograms, where the restriction sites are represented by a tall rectangle and probe sites by small circles (Fig 6). Fig 6

69 S*M*A*S*H The steps above are repeated for all possible probe compositions (modulo reverse complementarity). Sutta software then uses the data from all such individual ideograms to create an assembly of the haplotypic ordered restriction maps with approximate probe locations superimposed on the map. ATAT TATC ATCA TCAT CATA ATATCATAT Fig 7

70 S*M*A*S*H Local clusters of overlapping words are combined by Sutta’s PSBH (positional sequencing by hybridization) algorithm to overlay the inferred haplotypic sequence on top of the restriction map (Fig 7). ATAT TATC ATCA TCAT CATA ATATCATAT Fig 7

71 Gapped Probes Mixing ‘solid’ bases with `wild-card’ bases:
E.g., xx*x**x*xx (10-4-mers) or xx*x****x*xx (12-6-mers) An ‘wild-card’ base Universal: In terms of its ability to form base pairs with the other natural DNA/RNA bases. Applications in primers and in probes for hybridization Examples: The naturally occurring base hypoxanthine, as its ribo- or 2'-deoxyribonucleoside 2'-deoxyisoinosine 7-deaza-2'-deoxyinosine 2-aza-2'-deoxyinosine

72 Simulation Results Probe Map Assumptions: For single DNA molecules:
Probe location Standard Deviation = 240 bases; Data coverage per probe map = 50x; Probe hybridization rate = 30%, and false positive rate of 10 probes per megabase, uniformly distributed. Analytically estimation of the average error rate in the probe consensus map: Probe location SD = 60 bases; False Positive rate < 2.4%; False Negative rate < 2.0%.

73 Simulation Results UNGAPPED GAPPED

74 Simulation Results Simulation based on non-random sequences from the human genome: 96 blocks of 1 Kb (from chromosome 1) concatenated together along with its in silico restriction map. Error summary for the gapped probe pattern xx*x **** x*xx: Error count excluding repeats or near repeats: 0.32bp / 10Kb There is no error due to incorrect rearrangements. There is no loss of information at haplotypic level. Assembly failed in 2 of 96 blocks of 1kb = 2.1% failure rate (out of memory).

75 GENomic conTIG Gentig uses a purely Bayesian Approach.
It models all the error processes in the prior. FAST: It initially starts with a conservative but fast pairwise overlap configuration, computed efficiently using Geometric Hashing. ACCURATE: It iteratively combines pairs of maps or map contigs, while optimizing the likelihood score subject to a constraint imposed by a false-positive constraint. It has special heuristics to handle non-local errors.

76 HAPTIG: HAPlotypic conTIG Candida Albicans
FAST & ACCURATE BAYESIAN ALGORITHM The left end of chromsome-1 of the common fungus Candida Albicans (being sequenced by Stanford). You can clearly see 3 polymorphisms: (A) Fragment 2 is of size 41.19kb (top) vs 38.73kb (bottom). (B) The 3rd fragment of size 7.76kb is missing from the top haplotype. (C)The large fragment in the middle is of size 61.78kb vs 59.66kb.

77 Lambda DNA with probes 10 mm

78 A 500 nm Fig. A : Four AFM images of lambda DNA with PNA probes hybridized to the distal recognition site, located 6,900 bp or 2.28 microns from the end (green arrow). Non-specifically bound probes indicated by the red arrows. Z-scale is +/- 1.5 nm.

79 E. coli Figure 3. Two optical images of E coli K12 genomic DNA after restriction digestion with 6-cutter restriction enzyme Xho 1 and hybridization with an 8-mer PNA probe. Bound probes are indicated by blue arrows and non-specifically bound probes by the red arrows. Scale bar shown is 10 micron.

80 Discussions Q&A…

81 Answer to Cancer “If I know the answer I'll tell you the answer, and if I don't, I'll just respond, cleverly.” US Secretary of Defense, Mr. Donald Rumsfeld.

82 To be continued… Break…

Download ppt "Computational Systems Biology of Cancer:"

Similar presentations

Ads by Google