CSCI2950-C Lecture 10 Cancer Genomics: Duplications

CSCI2950-C Lecture 10 Cancer Genomics: Duplications
October 21, 2008

Outline Cancer Genomes Cancer Progression Models
Duplications and End Sequence Profiling Comparative Genomic Hybridization Cancer Progression Models

End Sequence Profiling (ESP) C. Collins and S. Volik (2003)
Pieces of cancer genome: clones ( kb). Cancer DNA Sequence ends of clones (500bp). Map end sequences to human genome. Because of end sequencing protocol, clones have direction x y Human DNA Each clone corresponds to pair of end sequences (ES pair) (x,y). Retain clones that correspond to a unique ES pair.

Pieces of cancer genome: clones ( kb). Cancer DNA Sequence ends of clones (500bp). L Valid ES pairs Lmin ≤ y – x ≤ Lmax, min (max) size of clone. Convergent orientation. Map end sequences to human genome. Because of end sequencing protocol, clones have direction x y Human DNA

Pieces of cancer genome: clones ( kb). Cancer DNA Sequence ends of clones (500bp). L Map end sequences to human genome. Because of end sequencing protocol, clones have direction. Some pairs cannot be mapped due to repeats in human genome. x y a b Human DNA Invalid ES pairs Putative rearrangement in cancer ES directions toward breakpoints (a,b): Lmin ≤ |x-a| + |y-b| ≤ Lmax a x y b

ESP of Normal Cell 2D Representation All ES pairs valid. x y Human DNA
Lmin ≤ y – x ≤ Lmax 2D Representation Each point (x,y) is ES pair. Sometimes useful to have a 2D representation. All ES pairs near diagonal. Genome Coordinate Genome Coordinate

ESP of Tumor Cell Valid ES pairs satisfy length/direction constraints
Lmin ≤ y – x ≤ Lmax Invalid ES pairs indicate rearrangements experimental errors First show how to deal with noise….

What about duplications?
Clusters were biased to a very small region of the genome. These regions are duplicated – can be measured as such independently. Take a duplication centric view 11240 ES pairs 10453 valid (black) 737 invalid 489 isolated (red) 248 form 70 clusters (blue) 33/70 clusters Total length: 31Mb

Structure of Duplications in Tumors?
Duplicated segments may co-localize (Guan et al. Nat.Gen.1994) Human genome Tumor genome Mechanisms not well understood.

Structure of Duplications: Approaches
Model free Assemble tumor genome Model based Use knowledge of duplication mechanisms Human genome Could just sequence the genome. We didn’t have enough data – so used a model based approach based on known biology. Tumor genome

Duplication by Amplisome
(Maurer, et al. 1987; Wahl, 1989…) Other terms: Episome Amplicon Double-minute Extrachromosomal amplification. Gives single model of duplication.

Amplisome Reconstruction Problem
Assume Tumor genome sequence is known. Insertions are independent, i.e. no insertions within insertions We don’t know the genome sequence. Can we do something similar? Approach Identify duplicated sequences A1, …, Am Amplisome is shortest common superstring of A1, …, Am

Analyzing Duplications
Tumor Human duplication u A B w C D v E u A B w D C v E u A B w C v D ????

Tumor Human duplication u A B w C D v E u A B w D C v E u A B w C v D u A B C D ?? Signature of a duplication. Complicated duplication. Overlapping duplication and rearrangement

Tumor Human duplication u A B w C D v E u A B w D C v E u A B w C v D A C D B u Overlapping duplication and rearrangement Additional ES pair resolves duplication

Duples and Boundary Elements
Tumor Human duplication u A B w C D v E u A B w D C v E u A B w C v D A C D B u Call this configuration a duple with boundary elements v and w.

Duplications in ESP graph
B C D E u A B w D C v E u v w boundary elements v,w are vertices in ESP graph E w duple D v C B u A

Duplications in ESP graph
B C D E u A B w D C v E u v w boundary elements v,w are vertices in ESP graph E w duple D v C Path between boundary elements resolves duple. B u A

Duplication Complications
B w C v E ???? w These configurations frequent in MCF7 data. v u

Resolving Duplication as Paths
B C D E A B u v w u w Path between boundary elements resolves duple. v u

Resolving Duplications as Paths
B C E A B u v w u w Multiple paths between duple boundary elements. v u

ESP Amplisome Reconstruction Problem
Assume Insertions are independent, i.e. no insertions within insertions u A B w C v E Approach Identify endpoints of duplications: (v1, w1), , …, (vm, wm) Amplisome is shortest common superpath in ESP graph containing subpaths: v1…w1, v2…w2, …, vm…wm

Reconstructed MCF7 amplisome
Chromosomes 1 3 17 20 Would like to tell you that we’ve done this many times – but there just isn’t the data yet. Still waiting on new sequencing technologies. Noticed similarities w/ problem in evolution. Amplisome model explains 24/33 invalid clusters. 33 clusters Total length: 31Mb Raphael and Pevzner (2004) Bioinformatics.

DNA Basepairing

DNA Microarrays

Comparative Genomic Hybridization (CGH)
Measuring Mutations in Cancer Comparative Genomic Hybridization (CGH)

CGH Analysis (1) Divide genome into segments of equal copy number
-0.5 0.5 Log2(R/G) Genomic position Copy number of adjacent probes correlated -0.5 0.5 Deletion Amplification Genomic position

CGH Analysis (2) Identify aberrations common to multiple samples
Chromosome 3 of 26 lung tumor samples on mid-density cDNA array. Common deletion located in 3p21 and common amplification – in 3q. Samples

CGH Analysis (1) Divide genome into segments of equal copy number
Copy number profile Copy number Genome coordinate Numerous methods (e.g. clustering, Hidden Markov Model, Bayesian, etc.) Segmentation Copy number of adjacent probes correlated Input: yi = log2 Ti / Ri , clone i = 1, …, N Output: Assignment s(yi)  {S1, …, SK} Si represent copy number states

An Approach to CGH Segmentation
Circular Binary Segmentation (CBS), Olshen et al. 2004 Use hypothesis test to compare means of two intervals using t-test -0.5 0.5 Deletion Amplification Genomic position

Interval Score Assume: Xi are independent, normally distributed
Lipson, et al. J. Computational Biology, 2006 Assume: Xi are independent, normally distributed µ and  denote the mean and standard deviation of the normal genomic data. Given an interval I spanning k probes, we define its score as:

Significance of Interval Score
Lipson, et al. J. Computational Biology, 2006 Assume: Xi ~ N(µ, )

The MaxInterval Problem
Input: A vector X=(X1…Xn) Output: An interval I  [1…n], that maximizes S(I ) Other intervals with high scores may be found by recursively calling this function. Exhaustive algorithm: O(n2)

MaxInterval Algorithm I: LookAhead
Assume given: m : An upper bound for the value of a single element Xi t : A lower bound on the maximum score sum length score s = jI Xj k s+m r k+r I = [i,…,i+k-1] I’ = [i,…,i+k+r-1] Solve for first r for which S(I ) may exceed t. Complexity: Expected O(n1.5) (unproved) Benchmarking Benchmarking results of the Exhaustive, LookAhead and GFA algorithms on synthetic vectors of varying lengths. Linear regression suggests that the complexities of the Exhaustive, LookAhead and GFA algorithms are O(n2), O(n1.5), O(n), respectively. 8

MaxInterval Algorithm II: Geometric Family Approximation (GFA)
For >0 define the following geometric family of intervals: kj j (j1) (j2) (j3) Theorem: Let I* be the optimal scoring interval. Let J be the leftmost longest interval of  fully contained in I*. Then S(J) ≥ S(I*)/, where   (1- -2 )-1. Complexity: O(n) MaxInterval Algorithm I: LookAhead Assume you are given: m – An upper bound for the value of a single element ci t – A lower bound on the maximum score If we are currently considering an interval I=[i,…,i+k-1] with a sum of s = jI cj, then the score of I is: The score of an interval I’ = [i,…,i+k+x-1] is then bounded by: Complexity: Expected O(n1.5) (unproved) Solve for first x for which S(I ) may exceed t. sum length score s k s+mx k+x I I’ 6 Benchmarking Benchmarking results of the Exhaustive, LookAhead and GFA algorithms on synthetic vectors of varying lengths. Linear regression suggests that the complexities of the Exhaustive, LookAhead and GFA algorithms are O(n2), O(n1.5), O(n), respectively. 8

Benchmarking synthetic vectors of varying lengths Linear regression suggests that the complexities of the Exhaustive, LookAhead and GFA algorithms are O(n2), O(n1.5), O(n), respectively.

Applications: Single Samples
-1 1 FRA16B A2BP1 50 Mbp 25 75 Log2(ratio) Chromosome 16 of HCT116 colon carcinoma cell line on high-density oligo array (n=5,464). 50 25 75 Mbp 1 ERBB2 Log2(ratio) Chromosome 17 of several breast carcinoma cell lines on mid-density cDNA array (n=364).

Another Approach to CGH Segmentation
Use Hidden Markov Model (HMM) to “parse” sequence of probes into copy number states -0.5 0.5 Deletion Amplification Genomic position

Hidden Markov Models 1 2 K … x1 x2 x3 xK

Example: The Dishonest Casino
A casino has two dice: Fair die P(1) = P(2) = P(3) = P(5) = P(6) = 1/6 Loaded die P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 Casino player switches back-&-forth between fair and loaded die once every 20 turns Game: You bet $1 You roll (always with a fair die) Casino player rolls (maybe with fair die, maybe with loaded die) Highest number wins $2

Question # 1 – Evaluation
GIVEN A sequence of rolls by the casino player QUESTION How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs Prob = 1.3 x 10-35

Question # 2 – Decoding GIVEN A sequence of rolls by the casino player
QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs. This is what we want to solve for CGH analysis FAIR LOADED FAIR

Question # 3 – Learning GIVEN A sequence of rolls by the casino player
QUESTION How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs Prob(6) = 64%

Definition of a hidden Markov model
Definition: A hidden Markov model (HMM) Alphabet  = { b1, b2, …, bM } Set of states Q = { 1, ..., K } Transition probabilities between any two states aij = transition prob from state i to state j ai1 + … + aiK = 1, for all states i = 1…K Start probabilities a0i a01 + … + a0K = 1 Emission probabilities within each state ei(b) = P( xi = b | i = k) ei(b1) + … + ei(bM) = 1, for all states i = 1…K 1 2 K …

The dishonest casino model
0.05 0.95 0.95 FAIR LOADED P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 0.05

A HMM is memory-less At each time step t, the only thing that affects future states is the current state t P(t+1 = k | “whatever happened so far”) = P(t+1 = k | 1, 2, …, t, x1, x2, …, xt) = P(t+1 = k | t) 1 2 K …

A parse of a sequence Given a sequence x = x1……xN, A parse of x is a sequence of states  = 1, ……, N 1 2 K … 1 1 2 K … 1 2 K … 1 2 K … … 2 2 K x1 x2 x3 xK

Likelihood of a Parse x1 x2 x3 xK
Simply, multiply all the orange arrows! (transition probs and emission probs) 1 2 K … 1 1 2 K … 1 2 K … 1 2 K … … 2 2 K x1 x2 x3 xK

Likelihood of a parse P(x, ) = j=1…n jF(j, x, ) =
Given a sequence x = x1……xN and a parse  = 1, ……, N, To find how likely is the parse: (given our HMM) P(x, ) = P(x1, …, xN, 1, ……, N) = P(xN, N | N-1) P(xN-1, N-1 | N-2)……P(x2, 2 | 1) P(x1, 1) = P(xN | N) P(N | N-1) ……P(x2 | 2) P(2 | 1) P(x1 | 1) P(1) = a01 a12……aN-1N e1(x1)……eN(xN) A compact way to write a01 a12……aN-1N e1(x1)……eN(xN) Number all parameters aij and ei(b); n params Example: a0Fair : 1; a0Loaded : 2; … eLoaded(6) = 18 Then, count in x and  the # of times each parameter j = 1, …, n occurs F(j, x, ) = # parameter j occurs in (x, ) (call F(.,.,.) the feature counts) Then, P(x, ) = j=1…n jF(j, x, ) = = exp[j=1…n log(j)F(j, x, )] 1 2 K … 1 1 2 K … 1 2 K … 1 2 K … … 2 2 K x1 x2 x3 xK

Example: the dishonest casino
Let the sequence of rolls be: x = 1, 2, 1, 5, 6, 2, 1, 5, 2, 4 Then, what is the likelihood of  = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair? (say initial probs a0Fair = ½, aoLoaded = ½) ½  P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) = ½  (1/6)10  (0.95)9 = ~= 0.5  10-9

So, the likelihood the die is fair in this run is just  10-9 OK, but what is the likelihood of  = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded? ½  P(1 | Loaded) P(Loaded, Loaded) … P(4 | Loaded) = ½  (1/10)9  (1/2)1 (0.95)9 = ~= 0.16  10-9 Therefore, it somewhat more likely that all the rolls are done with the fair die, than that they are all done with the loaded die

Let the sequence of rolls be: x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6 Now, what is the likelihood  = F, F, …, F? ½  (1/6)10  (0.95)9 = 0.5  10-9, same as before What is the likelihood  = L, L, …, L? ½  (1/10)4  (1/2)6 (0.95)9 = ~= 0.5  10-7 So, it is 100 times more likely the die is loaded

HMM Model for CGH data A+ C+ G+ S1 S2 S3 S4 Fridlyand et al. (2004)

A model for CGH data S1 S2 S3 S4 1, 1 K states copy numbers
Homozygous Deletion (copy =0) Heterozygous Deletion (copy =1) Normal (copy =2) Duplication (copy >2) 1, 1 2, 2 3, 3 4, 4 Emissions: Gaussians Copy number Genome coordinate

Sources BJ Raphael, S Volik, C Collins, PA Pevzner - Reconstructing tumor genome architectures. Bioinformatics, 2003 Raphael and Pevzner. Reconstructing tumor amplisomes. Bioinformatics, 2004 Olshen et al. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 2004. Lipson, et al. Efficient Calculation of Interval Scores for DNA Copy Number Data Analysis. Journal of Computational Biology, 2006. Fridyland, et al. Hidden Markov models approach to the analysis of array CGH data. Journal of Multivariate Analysis, 2004 (HMM slides)

CSCI2950-C Lecture 10 Cancer Genomics: Duplications

Similar presentations

Presentation on theme: "CSCI2950-C Lecture 10 Cancer Genomics: Duplications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCI2950-C Lecture 10 Cancer Genomics: Duplications

Similar presentations

Presentation on theme: "CSCI2950-C Lecture 10 Cancer Genomics: Duplications"— Presentation transcript:

Similar presentations

About project

Feedback