Presentation is loading. Please wait.

Presentation is loading. Please wait.

CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California.

Similar presentations


Presentation on theme: "CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California."— Presentation transcript:

1 CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California

2 CPM 20062 “Computational Genetics”

3 CPM 20063 The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.” “But our work previously has shown… that having one genetic code is important, but it's not all that useful.” (referring to comparative genomics). “I would be willing to make a predication that within 10 years, we will have the potential of offering any of you the opportunity to find out what particular genetic conditions you may be at increased risk for…” Washington, DC June, 26, 2000.

4 CPM 20064 Individually Tailored Medicine People react to different drugs in different ways. The vision: a simple DNA test would help to determine which medicine to prescribe.

5 CPM 20065 International consortium that aims in genotyping the genome of 270 individuals from four different populations. Launched in 2002. First phase was finished in October (Nature, 2005).

6 CPM 20066 Motivation Environmental Factors (50%) Genetic Factors (50%) Complex disease Multiple genes may affect the disease. Therefore, the effect of every single gene may be negligible.

7 CPM 20067 Disease Association Studies The search for genetic factors Comparing the DNA contents of two populations: Cases - individuals carrying the disease. Controls - background population. A significant discrepancy between the two populations is an evident to a causal gene.

8 CPM 20068 AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTC AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC AGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC Cases: Controls: Associated SNP Where should we look? SNP = Single Nucleotide Polymorphism Usually SNPs are bi-allelic (only two letters appear).

9 CPM 20069 Where should we look? person 1: ….AAGCTAAATTTG…. person 2: ….AAGCTAAGTTTG…. person 3: ….AAGCTAAGTTTG…. person 4: ….AAGCTAAATTTG…. person 5: ….AAGCTAAGTTTG…. SNP = Single Nucleotide Polymorphism Usually SNPs are bi-allelic (only two letters appear).

10 CPM 200610 Genotyping Technology Extracting the allele information for a SNP from a DNA sample. Considerable genotyping costs reductions in the last couple of years. Current cost allows for the genotyping of 500,000 SNPs for ~$1000 (compared to ~50 cents per SNP 3- 4 years ago).

11 CPM 200611 Computational Challenges

12 CPM 200612 Haplotypes SNPs in physical proximity are correlated. A sequence of alleles along a chromosome are called haplotypes.

13 CPM 200613 Haplotype Block Structure (Daly et al., 2001) Block 6 from Chromosome 5q31

14 CPM 200614 Haplotypes as Proxies for Rare SNPs Common haplotypes: –011000111 (23% of population) –000001111 (55% of population) –111111111 (14% of population) Tag SNPs 000000 001001 111111

15 CPM 200615 Tag SNP Selection Input: a set of genotypes Goal: find a set of t tag SNPs such that using these SNPs only, the error rate for the prediction of all other SNPs is minimized. Formulation by [H., Kimmel, Shamir, 05’] (STAMPA)

16 CPM 200616 Correlations between SNPs Tag SNPs AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA Cases: AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA Controls: AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA

17 CPM 200617 Basic Assumption Given two SNPs, the probabilities of the values at any intermediate SNPs do not change if we know the values of additional distal ones. SNP j SNP k intermediate SNPs

18 CPM 200618 1. Put aside one test genotype. Use the rest of the data to develop a majority rule for each pair of SNPs to predict intermediate SNPs values. 2. Average prediction error over all test genotypes gives a score to the pair j and k. 3. Apply dynamic programming to obtain best set of tag SNPs. STAMPA ( STAMPA ( S election of TA g SNPs to M aximize P rediction A ccuracy) Test genoteype SNP j SNP k intermediate SNPs

19 CPM 200619 Comparison: STAMPA vs. ldSelect x - STAMPA, - ldSelect 52 sets of Yoruba genotypes (Gabriel et al., 2002).

20 CPM 200620 The haplotype ancestral structure of two subtypes of NHL. The trees are automatically generated by HAP (H., Eskin, 04’).

21 CPM 200621 Phasing Cost effective genotyping technology gives genotypes and not haplotypes. Haplotypes Genotype                A C CG A C G T A ATCCGA AGACGC ATACGA AGCCGC Possible phases: AGACGA ATCCGC …. mother chromosome father chromosome

22 CPM 200622 Public Genotype Data Growth 2001 Daly et al. Nature Genetics 103 SNPs 40,000 genotypes Gabriel et al. Science 3000 SNPs 400,000 genotypes 2002 TSC Data Nucleic Acids Research 35,000 SNPs 4,500,000 genotypes 2003 Perlegen Data Science 1,570,000 SNPs 100,000,000 genotypes 2004 NCBI dbSNP Genome Research 3,000,000 SNPs 286,000,000 genotypes 2005 HapMap Phase 2 5,000,000+ SNPs 600,000,000+ genotypes 2006 - HAP’s speed allows it to phase whole-genome datasets - HAP is very accurate (Marchini et al., 2006).

23 CPM 200623 HAP Phasing Model A directed phylogenetic tree. {0,1} alphabet. Each site mutates at most once. No recombination. Goal: Finding a phase that fits the tree model Formulation: [Gusfield, 2003] 00000 01000 11000 01001 11100 11110 4 3 1 5 2

24 CPM 200624 Example Genotypes 02022 22200 21222 21200 02000 01022 Haplotypes 00000 01000 11100 01011 00000 01000 11000 01001 11100 01011 4 3 1 5 2 Given the tree and the haplotypes the phase is unique

25 CPM 200625 Phasing via Greedy A simple heuristic: –Find a haplotype that is compatible with as many genotypes as possible. –Assign the haplotype for these genotypes. –Continue with the rest of the genotypes. Intuition: Haplotypes with missing data.

26 CPM 200626 Haplotypes with missing data Input: 111*11*1 00*01*1* 01*000*0 11*11*11 *111**00 1111*11* 01*00010 Goal: Find a maximum likelihood phase. Output: 11111111 00001111 01000010 11111111 11110000 11111111 01000010

27 CPM 200627 Greedy Analysis (H., Karp, 2005) Maximum likelihood == minimum entropy solution. Entropy(Greedy) < Entropy(OPT) + 3. Can be viewed as a variant of set cover.

28 CPM 200628 Mother, Father, Child Trios Advantages: –Better phasing results (Marchini et al., 06’). –Population stratification (Spielman et al., 93’). Disadvantage: –50% more expensive (and thus, reduces power).

29 CPM 200629 1??11? ?100?? 1?0??? 10?11? 11?11? 1100?? 0100?? 100??? 110??? 1??11? 1100?? 0100?? 1?0??? 10011? 11111? 11000? 01001? 10011? 11000? Inferring Haplotypes From Trios Parent 1 Parent 2 Child 122112 210022 120222 Assumption: No recombination

30 CPM 200630 C Genotyping Trios via DNA pools [Beckman, Abel, Braun, H.] FM

31 CPM 200631 123456789 10111213141516 Mother transmitted allele AAAAAAAAGGGGGGGG Mother untransmitted allele AAAAGGGGAAAAGGGG Father transmitted allele AAGGAAGGAAGGAAGG Father untransmitted allele AGAGAGAGAGAGAGAG Father and Child pool – allele frequency 0123012312341234 Mother and Child pool – allele frequency 0011112222333344 -Every configuration has a different pair of values. -Except for configurations 7 and 10 (het-het-het).

32 CPM 200632 Genotyping Unrelated Individuals Edge size  pool size (accuracy) Vertex degree  amount of DNA used

33 CPM 200633 An algebraic view

34 CPM 200634 For every m, what is the largest n, so that m equations uniquely determine the n {0,1,2} variables? For every m, what is the largest n for which  A  {0,1} m  n, s.t. x,x’  {0,1,2} n, Ax=Ax’  x=x’

35 CPM 200635 Lower Bound A random matrix A. –For every x  {-2,-1,0,1,2} n, A i x=0 with prob. O(k -0.5 ) where k is the number of non-zero elements. –Since the rows are independent, the probability that Ax = 0 is O(k -m/2 ). –Using union bound, n=  (m log m).

36 CPM 200636 Upper Bound Counting argument: –There are at most (2n) m different values that Ax can take. –There are 3 n values for x. –3 n < (2n) m and so n < O(m log m).

37 CPM 200637 Further Challenges Population stratification –In case/control studies and in family based studies. –Admixed populations. Other pooling schemes –Practical considerations: error rates, missing data, scalability, etc. Inferring evolutionary processes (e.g. selection, recombination rate, haplotype ancestry, etc.).

38 CPM 200638 Summary Exciting times in genetics: changes in medicine may be felt in our lifetime. –An opportunity for Computer Scientists to have a huge impact. An interdisciplinary work is needed. It involves computer science, statistics, genetics, biology, and medicine.

39 CPM 200639 Acknowledgement UCSD –Eleazar Eskin. Tel-Aviv U. –Ron Shamir –Gad Kimmel –Noga Alon HIIT –Matti Kaariainen Sequenom Inc. –Andreas Braun –Ken Abel Perlegen Sciences –David Hinds –David Cox UC Berkeley –Richard Karp –Chris Skibola MPI –Rene Beier CHORI –Kenny Beckman

40 CPM 200640


Download ppt "CPM 20061 SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California."

Similar presentations


Ads by Google