SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail.

Slides:



Advertisements
Similar presentations
CZ5225 Methods in Computational Biology Lecture 9: Pharmacogenetics and individual variation of drug response CZ5225 Methods in Computational Biology.
Advertisements

Association Studies, Haplotype Blocks and Tagging SNPs Prof. Sorin Istrail.
SNP Applications statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
Genomics An introduction. Aims of genomics I Establishing integrated databases – being far from merely a storage Linking genomic and expressed gene sequences.
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
A Data Compression Problem The Minimum Informative Subset.
The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Picking SNPs Application to Association Studies Dana Crawford, PhD SeattleSNPs PGA University of Washington March 20, 2006.
Human Migrations Saeed Hassanpour Spring Introduction Population Genetics Co-evolution of genes with language and cultural. Human evolution: genetics,
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
CSE182-L17 Clustering Population Genetics: Basics.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
SNP Selection University of Louisville Center for Genetics and Molecular Medicine January 10, 2008 Dana Crawford, PhD Vanderbilt University Center for.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Standardization of Pedigree Collection. Genetics of Alzheimer’s Disease Alzheimer’s Disease Gene 1 Gene 2 Environmental Factor 1 Environmental Factor.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
A Most Challenging Problem “None of the [advances of the 20 th century medicine] depend on a deep knowledge of cellular processes or on any discoveries.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Human SNP haplotypes Statistics 246, Spring 2002 Week 15, Lecture 1.
SNPs Daniel Fernandez Alejandro Quiroz Zárate. A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
SNPs and the Human Genome Prof. Sorin Istrail. A SNP is a position in a genome at which two or more different bases occur in the population, each with.
Gene Hunting: Linkage and Association
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.
Linkage and Mapping. Figure 4-8 For linked genes, recombinant frequencies are less than 50 percent.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
1 Balanced Translocation detected by FISH. 2 Red- Chrom. 5 probe Green- Chrom. 8 probe.
The HapMap Project and Haploview
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Of Sea Urchins, Birds and Men Algorithmic Functions of Computational Biology – Course 1 Professor Istrail.
Efficient Algorithms for SNP Haplotype Block Selection Problems Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Engineering College of Computing and Informatics.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Genomics A Systematic Study of the Locations, Functions and Interactions of Many Genes at Once.
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Introduction to SNP and Haplotype Analysis
Of Sea Urchins, Birds and Men
SNP Haplotype Block Partition and tagSNP Finding
Introduction to bioinformatics lecture 11 SNP by Ms.Shumaila Azam
Recombination (Crossing Over)
Patterns of Linkage Disequilibrium in the Human Genome
Estimating Recombination Rates
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Outline Cancer Progression Models
Ho Kim School of Public Health Seoul National University
Medical genomics BI420 Department of Biology, Boston College
Medical genomics BI420 Department of Biology, Boston College
Approximation Algorithms for the Selection of Robust Tag SNPs
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

SNPs and the Human Genome: The Minimal Informative Subset

Overview Introduction: SNPs, Haplotypes A Data Compression Problem: The Minimum Informative Subset A New Measure: Informativeness

A Most Challenging Problem “None of the [advances of the 20 th century medicine] depend on a deep knowledge of cellular processes or on any discoveries of molecular biology. Cancer is still treated by gross physical and chemical assaults on the offending tissue. Cardiovascular Disease is treated by surgery whose anatomical bases go back to the 19 th century … Of course, intimate knowledge of the living cell and of basic molecular processes may be useful eventually.” Lewontin (1991)

Now “A decade later, molecular biology can claim very few successes for drugs in clinical use that were designed ab initio to control a specific component of a pathway linked to disease: these include the monoclonal antibody Herceptin, and the kinase inhibitor Gleevec.” Reik, Gregory and Urnov (2002)

Introduction SNPs, HAPLOTYPES

A SNP is a position in a genome at which two or more different bases occur in the population, each with a frequency >1%. GATTTAGATCGCGATAGAG GATTTAGATCTCGATAGAG The most abundant type of polymorphism The two alleles at the site are G and T Single Nucleotide Polymorphism (SNP)

tttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatggcagagaacacaaaatatggccagtggc ctaaatccagcctactaccttttttttttttttgtaacattttactaacatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcag agttaagaaattgtagcagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaacttatc attatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatcctccaactttgtttttaatcaaaagtgttttggcc atcctagatatactttgtattgccacataaatttgaagatcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaat ctatagattaattagaggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccaccccccac tcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctcaagtccatataaaatgccatactatttgc atataacctctgcaatcctcccctatagtttagatcatctctagattacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgtt gagggttttttgttttgttttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtggtgagatcatagctt actgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctgggatacaggtgtgacccactgtgcccagttatt attttttatttgtattattttactgttgtattatttttaattattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggag ggctaactgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtttacttttcagtatacaagactttg acgttttttgttaaatgtatttgtaagtattttattatttgtgatgttatttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagca ctttgggaggctgaggcgggcagatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaataga aaaaaattagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacctgggaggcgg agcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtcaaaaaaataaataaaatttaaaaaaag aagaagaaattattttcttaatttcattttcaggttttttatttatttctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagct aacatagtttattatttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttctttcttttcaacctagactg gatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactagagaatgtattgaagaatatattgttgaacaaaagcagtgagagtgg acatccctgctttccccctgattttagggggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttga ggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctca tctc gaga gaga gaga gaga gaga gcgc gcgc gcgc tctc gaga gaga gaga gaga gaga tctc tctc tctc tctc gaga gaga gaga tctc gcgc tctc tctc tctc Human Genome contains ~ 3 G basepairs arranged in 46 chromosomes. Two individuals are 99.9% the same. I.e. differ in ~ 3 M basepairs. SNPs occur once every ~600 bp Average gene in the human genome spans ~27Kb ~50 SNPs per gene

G C T C G A C A A C A G G T T C G T C A A C A G Two individuals C A G Haplotypes T T G SNP Haplotype

Mutations Infinite Sites Assumption: Each site mutates at most once

Haplotype Pattern C A G T T T G A C A T G C T G T At each SNP site label the two alleles as 0 and 1. The choice which allele is 0 and which one is 1 is arbitrary.

G T T C G A C T A T T A G T T C G A C A A C A T A C G T A T C T A T T A Recombination

G T T C G A C T A T T A G T T C G A C A A C A T A C G T A T C T A T T A The two alleles are linked, I.e., they are “traveling together” ? Recombination disrupts the linkage Recombination

Variations in Chromosomes Within a Population Common Ancestor Emergence of Variations Over Time timepresent Disease Mutation Linkage Disequilibrium (LD)

Time = present 2,000 gens. ago Disease-Causing Mutation 1,000 gens. ago Extent of Linkage Disequilibrium

A Data Compression Problem The Minimum Informative Subset

A Data Compression Problem Select SNPs to use in an association study  Would like to associate single nucleotide polymorphisms (SNPs) with disease. Very large number of candidate SNPs  Chromosome wide studies, whole genome-scans  For cost effectiveness, select only a subset. Closely spaced SNPs are highly correlated  It is less likely that there has been a recombination between two SNPs if they are close to each other.

Disease Associations

Association studies Disease Responder Control Non-responder Allele 0Allele 1 Marker A is associated with Phenotype Marker A: Allele 0 = Allele 1 =

Evaluate whether nucleotide polymorphisms associate with phenotype TA GA A CG GA A CG TA A TA TC G TG TA G TG GA G Association studies

TA GA A CG GA A CG TA A TA TC G TG TA G TG GA G

Compression based on Haplotype Resolution

For a SNP s we associate a bipaprtite graph. Nodes: the set of haplotypes. Edges: the set of pairs of haplotypes with different alleles at s. s 1 s2 s2 D-graph of a SNP

For a set of SNPs S we associate a bipaprtite graph. Nodes: the set of haplotypes. Edges: the set of pairs of haplotypes with different alleles at some SNP s in S. s 1 s 2 D-graph of a set of SNPs

Red SNP is equivalent to Blue SNP SNP Selection

Red SNPs predict Green SNP SNP Selection

Minimal Informative Subset Data Compression

Compresssion based on Haplotype Blocks

Hypothesis – Haplotype Blocks? The genome consists largely of blocks of common SNPs with relatively little recombination within the blocks  Patil et al., Science, 2001;  Jeffreys et al., Nature Genetics, 2001;  Daly et al., Nature Genetics, 2001

Sense genes Antisense genes 200 kb 1234 DNA SNPs Haplotype blocks Haplotype Block Structure LD-Blocks, and 4-Gamete Test Blocks

Hudson and Kaplan 1985 A segment of SNPs is a block if between every pair of SNPs at most 3 out of the 4 gametes (00, 01,10,11) are observed BLOCKVIOLATES THE BLOCK DEFINITION Four Gamete Block Test

Finding Recombination Hotspots: Many Possible Partitions into Blocks A C T A G A T A G C C T G T T C G A C A A C A T A C T C T A T G A T C G G T T A T A C G A C A T A C T C T A T A G T A T A C T A G C T G G C A T All four gametes are present:

A C T A G A T A G C C T G T T C G A C A A C A T A C T C T A T G A T C G G T T A T A C G A C A T A C T C T A T A G T A T A C T A G C T G G C A T Find the left-most right endpoint of any constraint and mark the site before it a recombination site. Eliminate any constraints crossing that site. Repeat until all constraints are gone. The final result is a minimum-size set of sites crossing all constraints.

Data Compression ACGATCGATCATGAT GGTGATTGCATCGAT ACGATCGGGCTTCCG ACGATCGGCATCCCG GGTGATTATCATGAT A------A---TG-- G------G---CG-- A------G---TC-- A------G---CC-- G------A---TG-- Haplotype Blocks based on LD (Method of Gabriel et al.2002) Selecting Tagging SNPs in blocks

A New Measure Informativeness

s h2h2 h1h1

s 1 s 2 s 3 s 4 s 5 I(s 1,s 2 ) = 2/4 = 1/2 Informativeness

s 1 s 2 s 3 s 4 s 5 I({s 1,s 2 }, s 4 ) = 3/4 Informativeness

s 1 s 2 s 3 s 4 s 5 I({s 3,s 4 },{s 1,s 2,s 5 }) = 3 S={s 3,s 4 } is a Minimal Informative Subset Informativeness

Minimum Set Cover = Minimum Informative Subset s1s1 s2s2 s5s5 s3s3 s4s4 e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 SNPs Edges s1s1 s2s2 s3s3 s4s4 s5s5 Graph theory insight Informativeness

Minimum Set Cover {s 3, s 4 } = Minimum Informative Subset s1s1 s2s2 s5s5 s3s3 s4s4 e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 SNPsEdges s1s1 s2s2 s3s3 s4s4 s5s5 Informativeness Graph theory insight

Real Haplotype Data Two different runs of the Gabriel el al Block Detection method + Zhang et al SNP selection algorithm Our block-free algorithm A region of Chr Caucasian samples

When Maximum Likelihood = Bayesian = Parsimony