Of Sea Urchins, Birds and Men

Slides:



Advertisements
Similar presentations
Association Studies, Haplotype Blocks and Tagging SNPs Prof. Sorin Istrail.
Advertisements

SNP Applications statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
Genomics An introduction. Aims of genomics I Establishing integrated databases – being far from merely a storage Linking genomic and expressed gene sequences.
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
A Data Compression Problem The Minimum Informative Subset.
The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Positional Cloning LOD Sib pairs Chromosome Region Association Study Genetics Genomics Physical Mapping/ Sequencing Candidate Gene Selection/ Polymorphism.
Human Migrations Saeed Hassanpour Spring Introduction Population Genetics Co-evolution of genes with language and cultural. Human evolution: genetics,
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
A dynamic program algorithm for haplotype block partitioning Zhang, et. al. (2002) PNAS. 99, 7335.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Human SNP haplotypes Statistics 246, Spring 2002 Week 15, Lecture 1.
SNPs Daniel Fernandez Alejandro Quiroz Zárate. A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
SNPs and the Human Genome Prof. Sorin Istrail. A SNP is a position in a genome at which two or more different bases occur in the population, each with.
Gene Hunting: Linkage and Association
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Dynamic Programming Algorithms for Haplotype Block Partitioning:
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
The International Consortium. The International HapMap Project.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Of Sea Urchins, Birds and Men Algorithmic Functions of Computational Biology – Course 1 Professor Istrail.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Single Nucleotide Polymorphisms (SNPs
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
Introduction to SNP and Haplotype Analysis
SNP Haplotype Block Partition and tagSNP Finding
upstream vs. ORF binding and gene expression?
Introduction to bioinformatics lecture 11 SNP by Ms.Shumaila Azam
High-resolution haplotype structure in the human genome
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Genome Projects Maps Human Genome Mapping Human Genome Sequencing
Patterns of Linkage Disequilibrium in the Human Genome
Estimating Recombination Rates
Power to detect QTL Association
Genome-wide Associations
Linking Genetic Variation to Important Phenotypes
BI820 – Seminar in Quantitative and Computational Problems in Genomics
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Caroline Durrant, Krina T. Zondervan, Lon R
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Outline Cancer Progression Models
Ho Kim School of Public Health Seoul National University
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
Medical genomics BI420 Department of Biology, Boston College
Medical genomics BI420 Department of Biology, Boston College
Haplotypes When the presence of two or more polymorphisms on a single chromosome is statistically correlated in a population, this is a haplotype Example.
Approximation Algorithms for the Selection of Robust Tag SNPs
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Of Sea Urchins, Birds and Men

Introduction SNPs, HAPLOTYPES

Genome Assembly (Ch. 5) Hidden Markov Models (Ch Genome Assembly (Ch. 5) Hidden Markov Models (Ch. 4) Phylogenetic Trees (Ch. 3) Sequence Alignment (Ch. 1)

Single Nucleotide Polymorphism (SNP) GATTTAGATCGCGATAGAG GATTTAGATCTCGATAGAG A SNP is a position in a genome at which two or more different bases occur in the population, each with a frequency >1%. The most abundant type of polymorphism The two alleles at the site are G and T

Human Genome contains ~ 3 G basepairs arranged in 46 chromosomes. tttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatggcagagaacacaaaatatggccagtggcctaaatccagcctactaccttttttttttttttgtaacattttactaacatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcagagttaagaaattgtagcagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaacttatcattatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatcctccaactttgtttttaatcaaaagtgttttggccatcctagatatactttgtattgccacataaatttgaagatcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaatctatagattaattagaggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccaccccccactcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctcaagtccatataaaatgccatactatttgcatataacctctgcaatcctcccctatagtttagatcatctctagattacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgttgagggttttttgttttgttttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtggtgagatcatagcttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctgggatacaggtgtgacccactgtgcccagttattattttttatttgtattattttactgttgtattatttttaattattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggagggctaactgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtttacttttcagtatacaagactttgacgttttttgttaaatgtatttgtaagtattttattatttgtgatgttatttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagcactttgggaggctgaggcgggcagatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatagaaaaaaattagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacctgggaggcggagcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtcaaaaaaataaataaaatttaaaaaaagaagaagaaattattttcttaatttcattttcaggttttttatttatttctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagctaacatagtttattatttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttctttcttttcaacctagactggatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactagagaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatccctgctttccccctgattttagggggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttgaggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctca g c t a t c g a g a c Human Genome contains ~ 3 G basepairs arranged in 46 chromosomes. Two individuals are 99.9% the same. I.e. differ in ~ 3 M basepairs. SNPs occur once every ~600 bp Average gene in the human genome spans ~27Kb ~50 SNPs per gene

C A G T T G G C T C G A C A A C A G G T T C G T C A A C A G Haplotype C A G Haplotypes T T G G C T C G A C A A C A G G T T C G T C A A C A G SNP SNP SNP Two individuals

Mutations Infinite Sites Assumption: Each site mutates at most once

Haplotype Pattern C A G T T T G A C A T G C T G T 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 1 At each SNP site label the two alleles as 0 and 1. The choice which allele is 0 and which one is 1 is arbitrary.

Recombination G T T C G A C A A C A T A C G T A T C T A T T A G T T C G A C T A T T A

Recombination G T T C G A C A A C A T A C G T A T C T A T T A The two alleles are linked, I.e., they are “traveling together” G T T C G A C A A C A T A C G T A T C T A T T A Recombination disrupts the linkage ? G T T C G A C T A T T A

Linkage Disequilibrium (LD) Common Ancestor Emergence of Variations Over Time time present Variations in Chromosomes Within a Population Disease Mutation

Extent of Linkage Disequilibrium 2,000 gens. ago Disease-Causing Mutation 1,000 gens. ago Time = present

A Data Compression Problem Select SNPs to use in an association study Would like to associate single nucleotide polymorphisms (SNPs) with disease. Very large number of candidate SNPs Chromosome wide studies, whole genome-scans For cost effectiveness, select only a subset. Closely spaced SNPs are highly correlated It is less likely that there has been a recombination between two SNPs if they are close to each other. Goal of talk Want to study genomic variation, the particular type of variations we are interested in are single nucleotide polymorphisms. This may be useful for instance when one is doing an association study – will be implemented as a part of Applera assays on Demand and iScience initiatives. A large number of SNPs are known to exists, when one does an association study one would like to minimize the cost of doing the study. Selecting only a subset of the informative set of SNPs We are considering the case when there is a large number of SNPs to use for our study, such as when doing chromosomal wide or genomewide scans. For practical reasons, such as costeffictiveness we would like to limit the number of SNPs that we use. The main guidance that we have in this pursuit is the fact that closely spaced SNPs are highly correleated.

Disease Associations

Marker A is associated with Phenotype Association studies Disease Responder Control Non-responder Allele 0 Allele 1 Marker A is associated with Phenotype Marker A: Allele 0 = Allele 1 =

Association studies T A G C Evaluate whether nucleotide polymorphisms associate with phenotype T A G C Spend a lot of time on this slide

Association studies C G A C G T A T A G T A C G T G A T G A Spend a lot of time on this slide T A C G T G A T G A

Association studies 1 1 Spend a lot of time on this slide 1 1 1

Real Haplotype Data A region of Chr. 22 45 Caucasian samples We see that in this region has significant overlap in blocks Our block-free algorithm Two different runs of the Gabriel el al Block Detection method + Zhang et al SNP selection algorithm

A Data Compression Problem The Minimum Informative Subset

A Data Compression Problem Select SNPs to use in an association study Would like to associate single nucleotide polymorphisms (SNPs) with disease. Very large number of candidate SNPs Chromosome wide studies, whole genome-scans For cost effectiveness, select only a subset. Closely spaced SNPs are highly correlated It is less likely that there has been a recombination between two SNPs if they are close to each other. Goal of talk Want to study genomic variation, the particular type of variations we are interested in are single nucleotide polymorphisms. This may be useful for instance when one is doing an association study – will be implemented as a part of Applera assays on Demand and iScience initiatives. A large number of SNPs are known to exists, when one does an association study one would like to minimize the cost of doing the study. Selecting only a subset of the informative set of SNPs We are considering the case when there is a large number of SNPs to use for our study, such as when doing chromosomal wide or genomewide scans. For practical reasons, such as costeffictiveness we would like to limit the number of SNPs that we use. The main guidance that we have in this pursuit is the fact that closely spaced SNPs are highly correleated.

Marker A is associated with Phenotype Association studies Disease Responder Control Non-responder Allele 0 Allele 1 Marker A is associated with Phenotype Marker A: Allele 0 = Allele 1 =

Association studies T A G C Evaluate whether nucleotide polymorphisms associate with phenotype T A G C Spend a lot of time on this slide

Association studies C G A C G T A T A G T A C G T G A T G A Spend a lot of time on this slide T A C G T G A T G A

Hypothesis – Haplotype Blocks? The genome consists largely of blocks of common SNPs with relatively little recombination within the blocks Patil et al., Science, 2001; Jeffreys et al., Nature Genetics, 2001; Daly et al., Nature Genetics, 2001

Haplotype Block Structure LD-Blocks, and 4-Gamete Test Blocks 200 kb Sense genes DNA Antisense genes SNPs Haplotype blocks 1 2 3 4

Dynamic programming framework Partitioning a chromosome into blocks Zhang et al. (PNAS, 2002). Zhang et al. RECOMB, 2003 H. I. Avi-Itzhak et al. PSB, 2003 Sebastiani et al. PNAS 2003 Patil et al., PNAS 2002. Parametric in block test Solve a dynamic program Optimal block partition requires the minimal number of blocks. Within blocks one can select the SNPs that maximize entropy, diversity or r2 correlation

Data Compression A------A---TG-- ACGATCGATCATGAT G------G---CG-- A------G---TC-- A------G---CC-- G------A---TG-- ACGATCGATCATGAT GGTGATTGCATCGAT ACGATCGGGCTTCCG ACGATCGGCATCCCG GGTGATTATCATGAT Given this definition we can select one tagging SNP for the first block, two for the second and one for the third Given the value of the tagging SNPs the value of all the other SNPs can be predicted. Hence it suffices only to type only the tagging SNPs in order to infer all the haplotype blocks. Alternate objective functions exists so as to capture a significant fraction of the information. Selecting Tagging SNPs in blocks Haplotype Blocks based on LD (Method of Gabriel et al.2002)

Informativeness 1 1 1 1 I({s1,s2}, s4) = 3/4 s1 s2 s3 s4 s5 1 I({s1,s2}, s4) = 3/4 1 Spend a lot of time on this slide 1 1 s1 s2 s3 s4 s5

Informativeness 1 1 1 1 I({s3,s4},{s1,s2,s5}) = 3 1 I({s3,s4},{s1,s2,s5}) = 3 1 Spend a lot of time on this slide 1 S={s3,s4} is a Minimal Informative Subset 1 s1 s2 s3 s4 s5

Informativeness Edges SNPs Graph theory insight Minimum Set Cover = Minimum Informative Subset e4 s4 e3 s3 1 s1 s2 s3 s4 s5 s2 e2 Spend a lot of time on this slide s1 e1 Edges SNPs

Informativeness SNPs Edges Graph theory insight Minimum Set Cover {s3, s4} = Minimum Informative Subset e4 s4 e3 s3 1 s1 s2 s3 s4 s5 s2 e2 Spend a lot of time on this slide s1 e1 SNPs Edges