Presentation is loading. Please wait.

Presentation is loading. Please wait.

Of Sea Urchins, Birds and Men

Similar presentations


Presentation on theme: "Of Sea Urchins, Birds and Men"— Presentation transcript:

1 Of Sea Urchins, Birds and Men

2 Introduction SNPs, HAPLOTYPES

3 Genome Assembly (Ch. 5) Hidden Markov Models (Ch
Genome Assembly (Ch. 5) Hidden Markov Models (Ch. 4) Phylogenetic Trees (Ch. 3) Sequence Alignment (Ch. 1)

4 Single Nucleotide Polymorphism (SNP)
GATTTAGATCGCGATAGAG GATTTAGATCTCGATAGAG A SNP is a position in a genome at which two or more different bases occur in the population, each with a frequency >1%. The most abundant type of polymorphism The two alleles at the site are G and T

5 Human Genome contains ~ 3 G basepairs arranged in 46 chromosomes.
tttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatggcagagaacacaaaatatggccagtggcctaaatccagcctactaccttttttttttttttgtaacattttactaacatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcagagttaagaaattgtagcagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaacttatcattatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatcctccaactttgtttttaatcaaaagtgttttggccatcctagatatactttgtattgccacataaatttgaagatcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaatctatagattaattagaggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccaccccccactcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctcaagtccatataaaatgccatactatttgcatataacctctgcaatcctcccctatagtttagatcatctctagattacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgttgagggttttttgttttgttttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtggtgagatcatagcttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctgggatacaggtgtgacccactgtgcccagttattattttttatttgtattattttactgttgtattatttttaattattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggagggctaactgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtttacttttcagtatacaagactttgacgttttttgttaaatgtatttgtaagtattttattatttgtgatgttatttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagcactttgggaggctgaggcgggcagatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatagaaaaaaattagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacctgggaggcggagcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtcaaaaaaataaataaaatttaaaaaaagaagaagaaattattttcttaatttcattttcaggttttttatttatttctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagctaacatagtttattatttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttctttcttttcaacctagactggatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactagagaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatccctgctttccccctgattttagggggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttgaggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctca g c t a t c g a g a c Human Genome contains ~ 3 G basepairs arranged in 46 chromosomes. Two individuals are 99.9% the same. I.e. differ in ~ 3 M basepairs. SNPs occur once every ~600 bp Average gene in the human genome spans ~27Kb ~50 SNPs per gene

6 C A G T T G G C T C G A C A A C A G G T T C G T C A A C A G
Haplotype C A G Haplotypes T T G G C T C G A C A A C A G G T T C G T C A A C A G SNP SNP SNP Two individuals

7 Mutations Infinite Sites Assumption: Each site mutates at most once

8 Haplotype Pattern C A G T T T G A C A T G C T G T 0 0 0 0 1 1 0 1
At each SNP site label the two alleles as 0 and 1. The choice which allele is 0 and which one is 1 is arbitrary.

9 Recombination G T T C G A C A A C A T A C G T A T C T A T T A G T T C G A C T A T T A

10 Recombination G T T C G A C A A C A T A C G T A T C T A T T A
The two alleles are linked, I.e., they are “traveling together” G T T C G A C A A C A T A C G T A T C T A T T A Recombination disrupts the linkage ? G T T C G A C T A T T A

11 Linkage Disequilibrium (LD)
Common Ancestor Emergence of Variations Over Time time present Variations in Chromosomes Within a Population Disease Mutation

12 Extent of Linkage Disequilibrium
2,000 gens. ago Disease-Causing Mutation 1,000 gens. ago Time = present

13 A Data Compression Problem
Select SNPs to use in an association study Would like to associate single nucleotide polymorphisms (SNPs) with disease. Very large number of candidate SNPs Chromosome wide studies, whole genome-scans For cost effectiveness, select only a subset. Closely spaced SNPs are highly correlated It is less likely that there has been a recombination between two SNPs if they are close to each other. Goal of talk Want to study genomic variation, the particular type of variations we are interested in are single nucleotide polymorphisms. This may be useful for instance when one is doing an association study – will be implemented as a part of Applera assays on Demand and iScience initiatives. A large number of SNPs are known to exists, when one does an association study one would like to minimize the cost of doing the study. Selecting only a subset of the informative set of SNPs We are considering the case when there is a large number of SNPs to use for our study, such as when doing chromosomal wide or genomewide scans. For practical reasons, such as costeffictiveness we would like to limit the number of SNPs that we use. The main guidance that we have in this pursuit is the fact that closely spaced SNPs are highly correleated.

14 Disease Associations

15 Marker A is associated with Phenotype
Association studies Disease Responder Control Non-responder Allele 0 Allele 1 Marker A is associated with Phenotype Marker A: Allele 0 = Allele 1 =

16 Association studies T A G C
Evaluate whether nucleotide polymorphisms associate with phenotype T A G C Spend a lot of time on this slide

17 Association studies C G A C G T A T A G T A C G T G A T G A
Spend a lot of time on this slide T A C G T G A T G A

18 Association studies 1 1 Spend a lot of time on this slide 1 1 1

19 Real Haplotype Data A region of Chr. 22 45 Caucasian samples
We see that in this region has significant overlap in blocks Our block-free algorithm Two different runs of the Gabriel el al Block Detection method + Zhang et al SNP selection algorithm

20 A Data Compression Problem
The Minimum Informative Subset

21 A Data Compression Problem
Select SNPs to use in an association study Would like to associate single nucleotide polymorphisms (SNPs) with disease. Very large number of candidate SNPs Chromosome wide studies, whole genome-scans For cost effectiveness, select only a subset. Closely spaced SNPs are highly correlated It is less likely that there has been a recombination between two SNPs if they are close to each other. Goal of talk Want to study genomic variation, the particular type of variations we are interested in are single nucleotide polymorphisms. This may be useful for instance when one is doing an association study – will be implemented as a part of Applera assays on Demand and iScience initiatives. A large number of SNPs are known to exists, when one does an association study one would like to minimize the cost of doing the study. Selecting only a subset of the informative set of SNPs We are considering the case when there is a large number of SNPs to use for our study, such as when doing chromosomal wide or genomewide scans. For practical reasons, such as costeffictiveness we would like to limit the number of SNPs that we use. The main guidance that we have in this pursuit is the fact that closely spaced SNPs are highly correleated.

22 Marker A is associated with Phenotype
Association studies Disease Responder Control Non-responder Allele 0 Allele 1 Marker A is associated with Phenotype Marker A: Allele 0 = Allele 1 =

23 Association studies T A G C
Evaluate whether nucleotide polymorphisms associate with phenotype T A G C Spend a lot of time on this slide

24 Association studies C G A C G T A T A G T A C G T G A T G A
Spend a lot of time on this slide T A C G T G A T G A

25 Hypothesis – Haplotype Blocks?
The genome consists largely of blocks of common SNPs with relatively little recombination within the blocks Patil et al., Science, 2001; Jeffreys et al., Nature Genetics, 2001; Daly et al., Nature Genetics, 2001

26 Haplotype Block Structure LD-Blocks, and 4-Gamete Test Blocks
200 kb Sense genes DNA Antisense genes SNPs Haplotype blocks 1 2 3 4

27 Dynamic programming framework
Partitioning a chromosome into blocks Zhang et al. (PNAS, 2002). Zhang et al. RECOMB, 2003 H. I. Avi-Itzhak et al. PSB, 2003 Sebastiani et al. PNAS 2003 Patil et al., PNAS 2002. Parametric in block test Solve a dynamic program Optimal block partition requires the minimal number of blocks. Within blocks one can select the SNPs that maximize entropy, diversity or r2 correlation

28 Data Compression A------A---TG-- ACGATCGATCATGAT G------G---CG--
A------G---TC-- A------G---CC-- G------A---TG-- ACGATCGATCATGAT GGTGATTGCATCGAT ACGATCGGGCTTCCG ACGATCGGCATCCCG GGTGATTATCATGAT Given this definition we can select one tagging SNP for the first block, two for the second and one for the third Given the value of the tagging SNPs the value of all the other SNPs can be predicted. Hence it suffices only to type only the tagging SNPs in order to infer all the haplotype blocks. Alternate objective functions exists so as to capture a significant fraction of the information. Selecting Tagging SNPs in blocks Haplotype Blocks based on LD (Method of Gabriel et al.2002)

29 Informativeness 1 1 1 1 I({s1,s2}, s4) = 3/4 s1 s2 s3 s4 s5
1 I({s1,s2}, s4) = 3/4 1 Spend a lot of time on this slide 1 1 s s s s s5

30 Informativeness 1 1 1 1 I({s3,s4},{s1,s2,s5}) = 3
1 I({s3,s4},{s1,s2,s5}) = 3 1 Spend a lot of time on this slide 1 S={s3,s4} is a Minimal Informative Subset 1 s s s s s5

31 Informativeness Edges SNPs Graph theory insight
Minimum Set Cover = Minimum Informative Subset e4 s4 e3 s3 1 s1 s2 s3 s4 s5 s2 e2 Spend a lot of time on this slide s1 e1 Edges SNPs

32 Informativeness SNPs Edges Graph theory insight
Minimum Set Cover {s3, s4} = Minimum Informative Subset e4 s4 e3 s3 1 s1 s2 s3 s4 s5 s2 e2 Spend a lot of time on this slide s1 e1 SNPs Edges


Download ppt "Of Sea Urchins, Birds and Men"

Similar presentations


Ads by Google