Presentation is loading. Please wait.

Presentation is loading. Please wait.

CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California.

Similar presentations


Presentation on theme: "CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California."— Presentation transcript:

1 CPM SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

2 CPM “Computational Genetics”

3 CPM The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.” “But our work previously has shown… that having one genetic code is important, but it's not all that useful.” (referring to comparative genomics). “I would be willing to make a predication that within 10 years, we will have the potential of offering any of you the opportunity to find out what particular genetic conditions you may be at increased risk for…” Washington, DC June, 26, 2000.

4 CPM Individually Tailored Medicine People react to different drugs in different ways. The vision: a simple DNA test would help to determine which medicine to prescribe.

5 CPM International consortium that aims in genotyping the genome of 270 individuals from four different populations. Launched in First phase was finished in October (Nature, 2005).

6 CPM Motivation Environmental Factors (50%) Genetic Factors (50%) Complex disease Multiple genes may affect the disease. Therefore, the effect of every single gene may be negligible.

7 CPM Disease Association Studies The search for genetic factors Comparing the DNA contents of two populations: Cases - individuals carrying the disease. Controls - background population. A significant discrepancy between the two populations is an evident to a causal gene.

8 CPM AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTC AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC AGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC Cases: Controls: Associated SNP Where should we look? SNP = Single Nucleotide Polymorphism Usually SNPs are bi-allelic (only two letters appear).

9 CPM Where should we look? person 1: ….AAGCTAAATTTG…. person 2: ….AAGCTAAGTTTG…. person 3: ….AAGCTAAGTTTG…. person 4: ….AAGCTAAATTTG…. person 5: ….AAGCTAAGTTTG…. SNP = Single Nucleotide Polymorphism Usually SNPs are bi-allelic (only two letters appear).

10 CPM Genotyping Technology Extracting the allele information for a SNP from a DNA sample. Considerable genotyping costs reductions in the last couple of years. Current cost allows for the genotyping of 500,000 SNPs for ~$1000 (compared to ~50 cents per SNP 3- 4 years ago).

11 CPM Computational Challenges

12 CPM Haplotypes SNPs in physical proximity are correlated. A sequence of alleles along a chromosome are called haplotypes.

13 CPM Haplotype Block Structure (Daly et al., 2001) Block 6 from Chromosome 5q31

14 CPM Haplotypes as Proxies for Rare SNPs Common haplotypes: – (23% of population) – (55% of population) – (14% of population) Tag SNPs

15 CPM Tag SNP Selection Input: a set of genotypes Goal: find a set of t tag SNPs such that using these SNPs only, the error rate for the prediction of all other SNPs is minimized. Formulation by [H., Kimmel, Shamir, 05’] (STAMPA)

16 CPM Correlations between SNPs Tag SNPs AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA Cases: AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA Controls: AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

17 CPM Basic Assumption Given two SNPs, the probabilities of the values at any intermediate SNPs do not change if we know the values of additional distal ones. SNP j SNP k intermediate SNPs

18 CPM Put aside one test genotype. Use the rest of the data to develop a majority rule for each pair of SNPs to predict intermediate SNPs values. 2. Average prediction error over all test genotypes gives a score to the pair j and k. 3. Apply dynamic programming to obtain best set of tag SNPs. STAMPA ( STAMPA ( S election of TA g SNPs to M aximize P rediction A ccuracy) Test genoteype SNP j SNP k intermediate SNPs

19 CPM Comparison: STAMPA vs. ldSelect x - STAMPA, - ldSelect 52 sets of Yoruba genotypes (Gabriel et al., 2002).

20 CPM The haplotype ancestral structure of two subtypes of NHL. The trees are automatically generated by HAP (H., Eskin, 04’).

21 CPM Phasing Cost effective genotyping technology gives genotypes and not haplotypes. Haplotypes Genotype                A C CG A C G T A ATCCGA AGACGC ATACGA AGCCGC Possible phases: AGACGA ATCCGC …. mother chromosome father chromosome

22 CPM Public Genotype Data Growth 2001 Daly et al. Nature Genetics 103 SNPs 40,000 genotypes Gabriel et al. Science 3000 SNPs 400,000 genotypes 2002 TSC Data Nucleic Acids Research 35,000 SNPs 4,500,000 genotypes 2003 Perlegen Data Science 1,570,000 SNPs 100,000,000 genotypes 2004 NCBI dbSNP Genome Research 3,000,000 SNPs 286,000,000 genotypes 2005 HapMap Phase 2 5,000,000+ SNPs 600,000,000+ genotypes HAP’s speed allows it to phase whole-genome datasets - HAP is very accurate (Marchini et al., 2006).

23 CPM HAP Phasing Model A directed phylogenetic tree. {0,1} alphabet. Each site mutates at most once. No recombination. Goal: Finding a phase that fits the tree model Formulation: [Gusfield, 2003]

24 CPM Example Genotypes Haplotypes Given the tree and the haplotypes the phase is unique

25 CPM Phasing via Greedy A simple heuristic: –Find a haplotype that is compatible with as many genotypes as possible. –Assign the haplotype for these genotypes. –Continue with the rest of the genotypes. Intuition: Haplotypes with missing data.

26 CPM Haplotypes with missing data Input: 111*11*1 00*01*1* 01*000*0 11*11*11 *111** *11* 01*00010 Goal: Find a maximum likelihood phase. Output:

27 CPM Greedy Analysis (H., Karp, 2005) Maximum likelihood == minimum entropy solution. Entropy(Greedy) < Entropy(OPT) + 3. Can be viewed as a variant of set cover.

28 CPM Mother, Father, Child Trios Advantages: –Better phasing results (Marchini et al., 06’). –Population stratification (Spielman et al., 93’). Disadvantage: –50% more expensive (and thus, reduces power).

29 CPM ??11? ?100?? 1?0??? 10?11? 11?11? 1100?? 0100?? 100??? 110??? 1??11? 1100?? 0100?? 1?0??? 10011? 11111? 11000? 01001? 10011? 11000? Inferring Haplotypes From Trios Parent 1 Parent 2 Child Assumption: No recombination

30 CPM C Genotyping Trios via DNA pools [Beckman, Abel, Braun, H.] FM

31 CPM Mother transmitted allele AAAAAAAAGGGGGGGG Mother untransmitted allele AAAAGGGGAAAAGGGG Father transmitted allele AAGGAAGGAAGGAAGG Father untransmitted allele AGAGAGAGAGAGAGAG Father and Child pool – allele frequency Mother and Child pool – allele frequency Every configuration has a different pair of values. -Except for configurations 7 and 10 (het-het-het).

32 CPM Genotyping Unrelated Individuals Edge size  pool size (accuracy) Vertex degree  amount of DNA used

33 CPM An algebraic view

34 CPM For every m, what is the largest n, so that m equations uniquely determine the n {0,1,2} variables? For every m, what is the largest n for which  A  {0,1} m  n, s.t. x,x’  {0,1,2} n, Ax=Ax’  x=x’

35 CPM Lower Bound A random matrix A. –For every x  {-2,-1,0,1,2} n, A i x=0 with prob. O(k -0.5 ) where k is the number of non-zero elements. –Since the rows are independent, the probability that Ax = 0 is O(k -m/2 ). –Using union bound, n=  (m log m).

36 CPM Upper Bound Counting argument: –There are at most (2n) m different values that Ax can take. –There are 3 n values for x. –3 n < (2n) m and so n < O(m log m).

37 CPM Further Challenges Population stratification –In case/control studies and in family based studies. –Admixed populations. Other pooling schemes –Practical considerations: error rates, missing data, scalability, etc. Inferring evolutionary processes (e.g. selection, recombination rate, haplotype ancestry, etc.).

38 CPM Summary Exciting times in genetics: changes in medicine may be felt in our lifetime. –An opportunity for Computer Scientists to have a huge impact. An interdisciplinary work is needed. It involves computer science, statistics, genetics, biology, and medicine.

39 CPM Acknowledgement UCSD –Eleazar Eskin. Tel-Aviv U. –Ron Shamir –Gad Kimmel –Noga Alon HIIT –Matti Kaariainen Sequenom Inc. –Andreas Braun –Ken Abel Perlegen Sciences –David Hinds –David Cox UC Berkeley –Richard Karp –Chris Skibola MPI –Rene Beier CHORI –Kenny Beckman

40 CPM


Download ppt "CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California."

Similar presentations


Ads by Google