Association mapping Fundamental Principles and a few methods Thomas Mailund Slides:

Slides:

Advertisements

Similar presentations

Statistical methods for genetic association studies

Advertisements

Association Tests for Rare Variants Using Sequence Data

Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.

METHODS FOR HAPLOTYPE RECONSTRUCTION

Recombination and genetic variation – models and inference

Genetic Analysis in Human Disease

Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.

Basics of Linkage Analysis

Association Mapping David Evans. Outline Definitions / Terminology What is (genetic) association? How do we test for association? When to use association.

MALD Mapping by Admixture Linkage Disequilibrium.

Association Mapping by Local Genealogies Bioinformatics Research Center University of Aarhus Thomas Mailund.

BiRC Bioinformatics Research Center...and Association Mapping through Local Genealogies.

Genetic Traits Quantitative (height, weight) Dichotomous (affected/unaffected) Factorial (blood group) Mendelian - controlled by single gene (cystic fibrosis)

A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College

Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.

Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.

Picking SNPs Application to Association Studies Dana Crawford, PhD SeattleSNPs PGA University of Washington March 20, 2006.

CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner

Association Mapping by Local Genealogies Bioinformatics Research Center University of Aarhus Thomas Mailund.

Population Genetics 101 CSE280Vineet Bafna. Personalized genomics April’08Bafna.

Genetic Analysis in Human Disease. Learning Objectives Describe the differences between a linkage analysis and an association analysis Identify potentially.

Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.

The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College

CS177 Lecture 10 SNPs and Human Genetic Variation

Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.

1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.

INTRODUCTION TO ASSOCIATION MAPPING

Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources

FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.

Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Association mapping for mendelian, and complex disorders January 16Bafna, BfB.

Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs

Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.

Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.

Association Mapping in Families Gonçalo Abecasis University of Oxford.

Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,

Molecular Evolution and Population Genetics A few notes on population genetics of interest in phylogenetics Thomas Mailund.

Fundamental Principles (and Applications)

Fast association mapping by incompatibilities

Common variation, GWAS & PLINK

Genetic Linkage.

Gil McVean Department of Statistics

Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor

Of Sea Urchins, Birds and Men

Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.

Signatures of Selection

Genome Wide Association Studies using SNP

Searching for Disease Causing Genes Thomas Mailund Bioinformatics ApS

Searching for Disease Causing Genes Thomas Mailund

Genetic Linkage.

PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)

Patterns of Linkage Disequilibrium in the Human Genome

Power to detect QTL Association

Genome-wide Associations

The ‘V’ in the Tajima D equation is:

Haplotype Reconstruction

BI820 – Seminar in Quantitative and Computational Problems in Genomics

Vineet Bafna/Pavel Pevzner

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

The coalescent with recombination (Chapter 5, Part 1)

Genetic Drift, followed by selection can cause linkage disequilibrium

Genetic Linkage.

Caroline Durrant, Krina T. Zondervan, Lon R

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Outline Cancer Progression Models

Medical genomics BI420 Department of Biology, Boston College

Medical genomics BI420 Department of Biology, Boston College

Haplotypes When the presence of two or more polymorphisms on a single chromosome is statistically correlated in a population, this is a haplotype Example.

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Presentation transcript:

Association mapping Fundamental Principles and a few methods Thomas Mailund Slides:

Outline ● Introduction ➔ Goals and setup ➔ Genetic variation in humans ● Marker/disease association ➔ Indirect association and linkage disequilibrium ➔ Background population genetics concepts ➔ “Global” and “local” genealogies ● Mapping methods ➔ Local genealogies (Blossoc) ➔ Clustering (HapCluster)

What are we looking for? Gunshot woundsCar accidentsSmoking inducedlung cancerCardiovasculardiseaseObesityDiabetes 2AlzheimerSchizophreniaBRCA1 breastcancer Cystic fibrosisHemophilia Environment Genes

Goal of association mapping Identification of susceptibility variant, replication in different cohort/population, understanding of genetic function at the cell level, this can lead to 1. identification of drugable targets, 2. development of drug for prevention 3. better understanding of the cellular processes that are involved in disease  treatments Mapping function = P(Disease gene location | data) Drugable target Treatment Understanding of Cellular processes

Case-control studies Disease Responder Control Non-responder Allele 1Allele 2 Marker A is associated with Phenotype Marker A: Allele 1 = Allele 2 = Cautions ● Subgroup analysis and multiple testing ● Poorly defined phenotypes ● Poorly matched controls, population stratification ● Failure to replicate ● Optimistic interpretation of results ● Positive publication bias ● Measuring the wrong variation

Relative risk ● Relative risk (RR) ● RR is the likelihood of disease in the exposed group (susceptibility allele or genotype carriers) compared to the unexposed group (not carriers) ● E.g. RR = 1.5 indicates that carriers of the A allele have 1.5 times the risk of disease than non-carriers, i.e. 50% more likely to get the disease. ● Genotype relative risk (GRR) ● Relative risk assigned to genotypes AA, Aa, aa ● GRR(Aa) = P(diseased|Aa) / P(diseased | aa) ● GRR(AA) = P(diseased|AA) / P(diseased | aa) ● E.g. additive model: ● P(disease | aa) = b; P(disease | Aa) = b + e; P(disease | AA) = b + 2e ● GRR(Aa) = (b+e)/b; GRR(AA) = (b+2e)/b = 2GRR(Aa)-1 ● If GRR(Aa) = 1.5, then GRR(AA) = 2

Relative risk: Examples ● Huntington’s Disease >1000 ● Cystic Fibrosis 400 ● Autism 75 ● Inflammatory Bowel Disease 60 ● Multiple Sclerosis 20 ● Juvenile Diabetes 15 ● Schizophrenia 10 ● Asthma 6 ● Prostate Cancer 5 ● Late Onset Diabetes 2-3 ● Breast Cancer 2 Examples from Lon Cardon Relative risk of being related to an affected (any genetic effect) ● Genes with relative risk for schizophrenia ➔ Neuregulin (NRG1) GRR: 2 ➔ Calcineurin (PPP3CC) GRR: 1.3 ➔ Cathechol-O-methyl transferase (COMT) GRR: 1.5

Genetic variation ● Very little variation in humans (compared to related species)

Genetic variation ● Each new cell contains ~3 new mutations ● Each new “child” ~20 new mutations ● On average the sequences of any two human genomes are 99.9% the same ➔ 0.1% of the genome ~ 3 million base pairs ➔ Maybe as much as 2.5 billion sites has variation in the entire population ● This genetic variation (plus environmental influences) is responsible for variation in human traits.

Types of variation Annu. Rev. Genom. Human Genet :

Single Nucleotide Polymorphisms ● The most common form of genetic polymorphism ● Common variants (MAF>5%) estimated to occur every bp (10 – 30 million SNPs)

HapMap ● Phase I: 1 million SNPs in 90 individuals from Europe, Africa and Asia ● Phase II: 3 million SNPs ●  SNP selection made easy ●  250K & 500K SNPs based on non-redundant Phase I commercially available ●  Genome wide scans a reality

Setup: case/control sequences --A C A----G T---C---A T G A----G C---C---A A G G----G C---C---A A C A----G T---C---A T C A----G T---C---A T C A----T T---A---A A C A----G T---C---A A C A----G T---C---G T C A----T T---C---A A C A----G T---C---A A C G----T C---A---A A C A----G C---C---G---- Cases (affected) Controls (unaffected) Sequences of nucleotides at known polymorphic sites

Actual setup: unphased sequences -A/T------C/G------A/A--G/G------T/C-C/C-A/A--- -A/T------C/G------G/A--G/G------T/C-C/C-A/A--- -T/T------C/C------A/A--G/T------T/T-C/A-A/A--- -A/A------C/C------A/A--G/G------T/T-C/C-A/A--- -A/T------C/C------A/A--G/T------T/T-C/C-G/A--- -A/A------C/C------G/A--G/T------T/C-C/A-A/A--- Sequences of pairs of nucleotides at known polymorphic sites Phase inference software: Phase SNPHAP but see also: Morris et al. 2004

Association mapping --A C A----G T---C---A T G A----G C---C---A A G G----G C---C---A A C A----G T---C---A T C A----G T---C---A T C A----T T---A---A A C A----G T---C---A A C A----G T---C---G T C A----T T---C---A A C A----G T---C---A A C G----T C---A---A A C A----G C---C---G---- We are searching for an association between variant and disease status Significant difference in distributions?

Significant difference in distribution --A C A----G T---C---A T G A----G C---C---A A G G----G C---C---A A C A----G T---C---A T C A----G T---C---A T C A----T T---A---A A C A----G T---C---A A C A----G T---C---G T C A----T T---C---A A C A----G T---C---A A C G----T C---A---A A C A----G C---C---G---- Consider a single marker...

Significant difference in distribution Affected Unaffected Allele G Allele T Contingency table: T, Unaffected G, Unaffected T, AffectedG, Affected

Significant difference in distribution Affected | G Unaffected | G Allele G Allele T Conditional disease status: T, Unaffected G, Unaffected T, AffectedG, Affected Affected | T Unaffected | T

Significant difference in distribution Affected | G Unaffected | G Allele G Allele T If the marker does not affect the disease: T, Unaffected G, Unaffected T, AffectedG, Affected Affected | T Unaffected | T P( Affected|G ) = P( Affected|T )

Significant difference in distribution Affected | G Unaffected | G Allele G Allele T If the marker does affect the disease: T, Unaffected G, Unaffected T, AffectedG, Affected Affected | T Unaffected | T P( Affected|G ) > P( Affected|T )

Significant difference in distribution Null-hypothesis: the marker does not affect the disease status P(T, Unaffected) = P(T)P(Unaffected) P(G, Unaffected) = P(G)P(Unaffected) P(T, Affected) = P(T)P(Affected) P(G, Affected) = P(G)P(Affected) P( Affected ) P( Unaffected ) P( G )P( T )

Significant difference in distribution ● The null-hypothesis tested with ➔ Fisher’s exact test (for small data sets) ➔ 2 test (large sample approximation) when each cell has count > 5 ➔ Allelic level: 2x2 matrix ➔ Genotype level: 2x3 matrix ➔ For two loci, there are 9 different two-loci genotypes, i.e. Interactions can be tested in a 2x9 matrix

Relative risk and power Statistical power: The probability of rejecting the null hypothesis when it is in fact false Simulations by M. Schierup 1000 simulations with additive disease model P(A) = 0.1; 1000 cases and 1000 controls 5% significance level (0.005% with Bonferroni correction)

The Central Dogma: the common disease / common variant (CD/CV) hypothesis Reich & Lander 2001 Population expansion < years ago Rare variant Common variant In a small population, allelic heterogeneity is small < years ago the human population was very small Even though the human population today is large the frequency spectrum of variants still reflects the recent small size/bottleneck  common diseases caused by few common variants (and a lot of rare undetectable variants caused by recent mutations) Past Present If association studies locate many susceptibility variants, the hypothesis has been tested true

Frequency and power Simulations by M. Schierup 1000 simulations with additive disease model 1000 cases and 1000 controls, P(disease | aa) = % significance level (0.005% with Bonferroni correction)

Example: Cystic fibrosis 2 -test for different distributions Kerem et al. (1989) Control group: 92 SNP Haplotypes Case group: 94 SNP Haplotypes 23 SNP Markers

An indirect approach --A C A----G---X----T---C---A T G A----G---X----C---C---A A G G----G---X----C---C---A A C A----G---X----T---C---A T C A----G---X----T---C---A T C A----T---X----T---A---A A C A----G---X----T---C---A A C A----G---X----T---C---G T C A----T---X----T---C---A A C A----G---X----T---C---A A C G----T---X----C---A---A A C A----G---X----C---C---G---- ● Disease site unlikely to be among our markers ➔ Might be an unknown polymorphic site (and not necessarily a SNP) ➔ Just not part of the typed markers (maybe typed 500K out of 3 billion nucleotides!)

An indirect approach --A C A----G---X----T---C---A T G A----G---X----C---C---A A G G----G---X----C---C---A A C A----G---X----T---C---A T C A----G---X----T---C---A T C A----T---X----T---A---A A C A----G---X----T---C---A A C A----G---X----T---C---G T C A----T---X----T---C---A A C A----G---X----T---C---A A C G----T---X----C---A---A A C A----G---X----C---C---G---- ● The markers are not independent ➔ Knowing one marker is partial knowledge of others ➔ The non-independence is called LD: Linkage Disequilibrium

Genealogical view of LD Variations in Chromosomes Within a Population Common Ancestor Emergence of Variations Over Time time present Disease Mutation

Linkage disequilibrium Variations in Chromosomes Within a Population P( ) = P( ) = 0.42 P( ) = 0.42 P( )P( ) = 0.17 D( ) = P( ) - P( )P( ) = 0.24 P( ) = 0.17 P( ) = 0.29 P( ) = 0.17 P( )P( ) = 0.05 D( ) = P( ) - P( )P( ) = 0.12

Measures of LD Correlation Coeffecient Measure [0,1] Hill & Robertson (1968) Range constrained by allele frequencies [0,1] Lewontin (1964) D’(AB) = if D(AB) > 0: D(AB) / min(P(A)P(b),P(a)P(B)) else: - D(AB) / min(P(A)P(B),P(a)P(b)) D(AB) = P(AB) – P(A)P(B) = D(ab) = -D(Ab) = -D(aB) r 2 (AB) = D 2 (AB) / P(A)P(a)P(B)P(b)

Linkage disequilibrium Variations in Chromosomes Within a Population P( ) = P( ) = 0.42 P( ) = 0.42 P( )P( ) = 0.17 D( ) = P( ) - P( )P( ) = 0.24 D’( ) = D( ) / min{P( )(1-P( )),(1-P( ))P( )} = 0.24 / min{0.42x0.58, 0.58x0.42} = 1 r 2 ( ) = D 2 ( ) / P( )(1-P( ))P( )(1-P( )) = 0.06 / 0.42x0.58x0.42x0.58 =1

Linkage disequilibrium Variations in Chromosomes Within a Population P( ) = 0.17 P( ) = 0.29 P( ) = 0.17 P( )P( ) = 0.05 D( ) = P( ) - P( )P( ) = 0.12 D’( ) = D( ) / min{P( )(1-P( )), (1-P( ))P( )} = 0.12 / min{0.17x0.71, 0.83x0.29} = 1 r 2 ( ) = D 2 ( ) / P( )(1-P( ))P( )(1-P( )) = 0.01 / 0.17x0.83x0.29x0.71 = 0.49

Causes of LD Time t ago Now Creates LDBreaks down LD DriftRecombination Selection(Gene conversion) Admixture

An indirect approach --T G A----G---X----C---C---A A G G----G---X----C---C---A A C A----G---X----T---C---A T C A----G---X----T---C---A T C A----T---X----T---A---A A C A----G---X----T---C---A A C A----G---X----T---C---G T C A----T---X----T---C---A A C A----G---X----T---C---A A C G----T---X----C---A---A A C A----G---X----C---C---G---- ● The markers are not independent ➔ Knowing one marker is partial knowledge of others ➔ This non-independence decreases with distance --A C A----G---X----T---C---A----

A short detour: population genetics Parents Gametes Diploid model of reproduction (Without recombination) Offsprin g Chromosome reproduction (without recombination)

Wright-Fisher model ● Discrete, non-overlapping generations ● Constant population size ● Each individual in one generation is ➔ a random copy of an individual from the previous generation ➔ or a new mutation Mutation

Recombinations

Non-Ancestral Material Crossover point

Wright-Fisher with recombination ● Discrete, non-overlapping generations ● Constant population size ● Each individual in one generation is ➔ a random copy of an individual from the previous generation ➔ a new mutation ➔ a recombination of two individuals from the previous generation, at a random cross-over point Recombination

Mutation + Recombination Mutation Point

Mutation + Recombination Mutation Point

Mutation + Recombination Mutation Point

Mutation + Recombination Mutation Point

Mutation + Recombination Mutation Point

Mutation + Recombination Mutation Point

Mutation + Recombination Mutation Point (+ )

Mutation + Recombination Mutation Point (+ )

Indirect association Mutation Point Complete association Less association Even less association

An indirect approach 2 -test for different distributions Highly associated because they are close to the disease affecting site

An indirect approach Linkage disequilibrium measured by r 2 using Haploview 3.12 These associations are NOT independent, i.e. they probably mark the same variant BRCA2 gene Prostate cancer in Iceland

Extend of relatedness ● “Nearby” is ~ 0.1–0.01 cM ➔ ~ 100–10 Kbp ➔ ~ 1/30,000 – 1/300,000 of the genome ● Closer spacing needed for accuracy ➔ ~ 500,000–1,000,000 for whole genome ➔ ~10–100 for typical gene

LD as a function of distance From: Clark et al. 2003, AJHG 73: Empirical results from HapMap data LD(r 2 ) Recombination rate From: Hein et al Simulation results

Extend of relatedness Isolated recently founded Quebec, Cajun Acadiana Utah Amish Iceland Faroese Islands extends over longer distances “low” density marker map low resolution extends over shorter distances “high” density marker map high resolution Isolated relatively old Kainuu (Finland) North Karelia (Finland) Sardinien Ashkenazi Jews non-isolated relatively old (bottlenecks) European, Asian ● Population dependent ➔ Founding age ➔ Isolation (inbreeding) Africa

Variation in recombination rate Sperm analysis Population genetic data Myers et al McVean et al. 2005

Tagging SNPs ● Close markers are in linkage disequilibrium, i.e. one marker carries information on nearby variation ● LD between SNPs are so high that typing the whole set will provide no more information than typing a few tagSNPs ● tagSNPs: a minimal number of informative markers can be used to identify the common haplotypes in each block

But notice! 6 markers with low association Responsible marker Distance from APOE locus (Kbp) Alzheimer and ApoE: Closeness to the disease marker does not guarantee significance!

Multi-marker approaches... --A C A----G T---C---A T G A----G C---C---A A G G----G C---C---A A C A----G T---C---A T C A----G T---C---A T C A----T T---A---A A C A----G T---C---A A C A----G T---C---G T C A----T T---C---A A C A----G T---C---A A C G----T C---A---A A C A----G T---C---A T G A----G C---C---A A G G----G C---C---A A C A----G T---C---A T C A----G T---C---A T C A----T T---A---A A C A----G T---C---A A C A----G T---C---G T C A----T T---C---A A C A----G T---C---A A C G----T C---A---A---- Single marker approach: Multi marker approach:

Using the (local) genealogy of the locus ● Tree at disease site: ➔ “Perfect” setup ➔ Incomplete penetrance ➔ Other disease causes HHHHHHHH DDDDD HHHHHHHH DDDHD HDHHHDHH DDDHD Templeton et al 1987

Using the (local) genealogy of the locus ● At the disease site: ➔ A significant clustering of diseased/healthy HDHHHDHH DDDHD Templeton et al 1987

Using the (local) genealogy of the locus ● Local genealogies ➔ Each site a different genealogy ➔ Nearby genealogies only slightly different --T G A----G---X----C----C-----A-- --A G G----G---X----C----C-----A-- --A C A----G---X----T----C-----A-- --T C A----G---X----T----C-----A-- --T C A----T---X----T----A-----A-- --A C A----G---X----T----C-----A-- AAATT T CCGG CC AAAGA A GGGG GT TTCCT T CCCC CA AAAAA A A nearby tree an imperfect local tree

Detour: Genealogies... MRCA of the sampled sequences A coalescent event for two sampled sequences

Detour: Genealogies... MRCA of the sampled sequences A coalescent event for two sampled sequences A recombination event

Ancestral Recombination Graph (Hudson 1990, Griffith&Marjoram 1996) Sampled sequences MRCA

Ancestral Recombination Graph (Hudson 1990, Griffith&Marjoram 1996) Recombinations Coalescence

Ancestral Recombination Graph (Hudson 1990, Griffith&Marjoram 1996) Non-ancestral material Non-ancestral material

Ancestral Recombination Graph Mutations 1234 (Hudson 1990, Griffith&Marjoram 1996)

Ancestral Recombination Graph 1234 (Hudson 1990, Griffith&Marjoram 1996) The ARG is a complete genealogy for the sampled sequences

“Local” genealogies For each “point” on the chromosome, the ARG determines a (local) tree:

“Local” genealogies For each “point” on the chromosome, the ARG determines a (local) tree:

“Local” genealogies For each “point” on the chromosome, the ARG determines a (local) tree:

“Local” genealogies For each “point” on the chromosome, the ARG determines a (local) tree:

“Local” genealogies ● Different topologies ● Different branch lengths ● Different inheritance

“Local” genealogies Type 1: No change Type 2: Change in branch lengths Type 3: Change in topology From Hein et al. 2005

“Local” genealogies Recombination rate From Hein et al M AB = [∑ i,j I {i=j} bl(i)bl(j)] / tbl(A)tbl(B) S AB = M AB / M AA Tree measure:

Using the (local) genealogy of the locus --T G A----G---X----C----C-----A-- --A G G----G---X----C----C-----A-- --A C A----G---X----T----C-----A-- --T C A----G---X----T----C-----A-- --T C A----T---X----T----A-----A-- --A C A----G---X----T----C-----A-- AAATT T CCGG CC AAAGA A GGGG GT TTCCT T CCCC CA AAAAA A Tree at disease site resembles neighbours

Using the (local) genealogy of the locus ● Near the disease site: ➔ A significant clustering of diseased/healthy HDHHHDHH DDDHD Templeton et al 1987 Zöllner&Pritchard 2004

Using the (local) genealogy of the locus ● Approach: ➔ Infer trees over regions ➔ Score the regions wrt their clustering HDHHHDHH DDDHD Templeton et al 1987 Zöllner&Pritchard 2004

BLOck aSSOCiation (BLOSSOC) Mailund et al ● In the infinite sites model: ➔ Each mutation occurs only once ➔ Each mutation splits the sample in two ➔ A consistent tree can efficiently be inferred for a region without recombinations

BLOck aSSOCiation (BLOSSOC) Mailund et al Use the four-gamete test to find regions that can be explained by a tree

BLOck aSSOCiation (BLOSSOC) Mailund et al Build a tree for each such region

BLOck aSSOCiation (BLOSSOC) Mailund et al Build a tree for each such region

BLOck aSSOCiation (BLOSSOC) Mailund et al Build a tree for each such region

BLOck aSSOCiation (BLOSSOC) Mailund et al Build a tree for each such region

BLOck aSSOCiation (BLOSSOC) Mailund et al Score the tree, and assign the score to the region

Scoring trees... Red=cases Green=controls Are the case chromosomes significantly overrepresented in some clusters?

Cystic Fibrosis example

Simulated Example (CoaSim)

Augmented HapMap data

Implementation... Homepage: Command line and graphical user interface...

Statistical model based approaches... Statistic al framewo rk Molecu lar biology Prior knowled ge Geneti cs Some model explaining the sequences and status --A C A----G T---C---A T G A----G C---C---A A G G----G C---C---A A C A----G T---C---A T C A----G T---C---A T C A----T T---A---A A C A----G T---C---A A C A----G T---C---G T C A----T T---C---A A C A----G T---C---A A C G----T C---A---A A C A----G C---C---G----

Statistical model based approaches... A model gives a probability distribution on the data: P( D | ) Data, e.g. sequences and disease status Parameters, e.g. penetrance, disease locus, and genealogy

Statistical model based approaches... A model gives a probability distribution on the data: P( D | ) also gives us likelihood approaches: lhd() = P( D | ) MLE: argmax lhd() and Bayesian approaches: P( |D ) ∝ P( D | ) P() = lhd() P()

MCMC (Metropolis) approach 1Compute the likelihood in the current point, lhd()=L 2Suggest a new point, ' 3Compute the likelihood in this point f(') = L’ 4If L ≤ L’, go to point ' 5If L > L’, go to point ' with the probability L’/L lhd(x) = ∫ lhd(x,) d All parameters except x

MCMC (Metropolis) approach 1 1 2? 2! Projection on one axis equivalent to integration over the remaining parameters The resulting samples approximate the likelihood lhd

The HapCluster model Waldron et al A----G----X---C---C---A A----T C---G---A A----G C---G---A T----G C---C---G A----T C---G---A T----G C---G---A---- Unrelated “wildtypes” (Locally) related “mutants”

The HapCluster model Waldron et al A----G----X---C---C---A A----T C---G---A A----G C---G---A T----G C---C---G---- (Locally) related “mutants” ● “Mutants” defined by local sequence similarity to “ancestral” sequence ● Implicitly assuming star-genealogy

The HapCluster model Waldron et al ● Given “ancestral” sequence and a distance measure: ➔ Defines cluster around the ancestral sequence ➔ Sequences above a given similarity threshold considered “mutants” ➔ Sequences below considered “wild types”

The HapCluster model Waldron et al ● Each individual has one of the genotypes: ➔ “mutant” & “mutant” ➔ “mutant” & “wild type” ➔ “wild type” & “wild type” ● Each has a different risk ( MM, MW, WW ) of being affected ● Likelihood:

The HapCluster model Waldron et al ● Risks considered nuisance parameters and integrated out

HapCluster MCMC approach Point: trait-locus, ancestral haplotype, other (nuisance) parameters Change functions: move trait-locus, change cluster size, change ancestral haplotype... Likelihood function: product of Beta functions Waldron et al. 2006

Example: Simulated dataset

Implementation... Homepage: Command line version only...

Summary ● Introduction ➔ Goals and setup ➔ Genetic variation in humans ● Marker/disease association ➔ Indirect association and linkage disequilibrium ➔ Background population genetics concepts ➔ “Global” and “local” genealogies ● Mapping methods ➔ Local genealogies (Blossoc) ➔ Clustering (HapCluster)