Presentation on theme: "Centro Nacional Genotipado Análisis bioinformático de secuencias y expresión de genes y genomas. Human SNPs. Teoria y prácticas. Arcadi Navarro Madrid,"— Presentation transcript:
Centro Nacional Genotipado Análisis bioinformático de secuencias y expresión de genes y genomas. Human SNPs. Teoria y prácticas. Arcadi Navarro Madrid, 6 de Abril de 2006
¿Qué es el CeGen? El CeGen es una plataforma tecnológica, iniciativa de GENOMA ESPAÑA, que tiene por objetivo proporcionar los elementos de conocimiento y la infraestructura necesaria para realizar proyectos de genotipado de SNPs (Single Nucleotide Polymorphisms) a gran escala y bajo coste. Los destinatarios de esta iniciativa son los grupos de investigación de universidades, hospitales, centros de investigación e industria. El CeGen pretende contribuir a dar un salto cualitativo y cuantitativo en la investigación mediante servicios de alto valor añadido proporcionados desde España.
Los SNPs Single Nucleotide Polymorphisms Frecuentes. Bien distribuidos. Estables. Funcionales?. Permiten procesamiento a gran escala.
What is a SNP (Single Nucleotide Polymorphism)? A SNP is a position in a genome at which two or more different bases occur in the population, each with a frequency >1%. GATTTAGATCGCGATAGAG GATTTAGATCTCGATAGAG - SNPs are the most common type of variations (genetic markers). - There are only two variants for SNP: G-T or A-C
Interés científico Hay una gran demanda de analizar SNPs en gran cantidad y las tecnologías disponibles son costosas y muy variadas. –El CeGen ofrece desde España servicios de genotipado a gran escala a bajo coste y adaptados a cada necesidad. –El CeGen ofrece soporte científico y acceso a unas tecnologías para proyectos que de otro modo requerirían mucho más tiempo y recursos A más facilidad para genotipar, se podrán abordar proyectos más ambiciosos y en mayor número Ej. En estudios de asociación donde se requiere genotipar muchos individuos (casos, controles): –Posibilidad de whole genome scan –Proyectos de BioBank
Interés estratégico El CeGen puede ayudar en la solicitud de proyectos de investigación que impliquen genotipado a varios niveles: –Proponiendo estrategias de selección SNPs –Ofreciendo las tecnologías a gran escala que dan viabilidad al proyecto –Calculando presupuestos que pueden incorporarse a la solicitud
Volvamos a los SNPs Frecuentes. Bien distribuidos. Estables. Funcionales?. Permiten procesamiento a gran escala.
Terminology Allele is one of a number of alterative forms of the same gene occupying a given locus. If we are considering SNPs, an allele is one of two alternative forms. Locus is physical location of allele on the chromosome Haplotype is a set of alleles that tend to be inherited together (not easily separable by recombination). Example: Consider 2 loci, each with two possible alleles, the first locus being either A or a, the second locus being B or b. Then the genotype of an individual have 4 possible haplotypes: AB, Ab, aB, ab.
¿Para qué sirven los SNPs? (I) Genotipado en humanos –Búsqueda de genes de susceptibilidad para enfermedades –Diagnóstico / pronóstico –Metabolismo de fármacos –Reacciones adversas a fármacos (medicina personalizada) –Reacción ante factores ambientales –Genética forense –Estructura y dinámica del genoma –Evolución del genoma
¿Para qué sirven los SNPs? (II) Genotipado en otras especies –Identificación microbiana –Análisis de comunidades microbianas –Genotipación en levaduras –Uso similar a humanos en ratón y otros organismos modelo –Gran uso en especies domésticas (QTL) tanto animales como vegetales –Identificación de variedades vegetales Su uso se expandirá con nuevos mapas de SNPs
SNPs in the Human Genome All humans share 99.9% the same genetic sequence –SNPs occur about every 1000 base pairs 90% of human genome variation comes from SNPs The human genome contains 10 million validated SNPs and 21 million submited. –~340,000 SNPs are found in genes SNPs are not evenly spaced along the sequence –SNP-rich regions –SNP-poor regions
What is a Haplotype? A haplotype is a sequence of alleles stretching along an extended segment of DNA – a sort of super allele! Haplotypes are usually inherited as a single unit from parents. AB ab haplotype Ab aB haplotype Aa Bb
Alleles of Adjacent SNPs on a Chromosome form Haplotypes a. Short stretch of DNA for 4 different people – 3 SNPs are present b. Haplotypes made up of a combination of different alleles at 20 nearby SNPs c. Genotyping just 3 “tag” SNPs can distinguish all 4 haplotypes
What is Linkage Disequilibrium? Linkage Disequilibrium (LD) is nonindependence (nonrandomness) of alleles at different sites (different SNPs for the rest of the session). Example: Suppose that allele A at locus 1 and allele B at locus 2 are at frequencies p A and p B, respectively, in the population. If the two loci are independent, then we would expect to see the AB haplotype at frequency p A p B. If the population frequency of the AB haplotype is either higher or lower than this—implying that particular alleles tend to be observed together—then the two loci are said to be in LD.
A B A b a B a b Eg. Two adjacent SNPs (A and B) are genotyped in a population. Linkage “Equilibrium” There are 4 possible haplotypes Aa Bf AB f aB fBfB bf Ab fbfb fAfA fafa Under linkage equilibrium we have what we expect f AB = f A f B f aB = f a f B f Ab = f A f b SNP 1 SNP 2
A B A b a B a b Eg. Two adjacent SNPs are genotyped in a population Linkage Disequilibrium (LD) There are 4 possible haplotypes Aa Bf AB f aB fBfB bf Ab f ab fbfb fAfA fafa Under linkage disequilibrium we have different results f AB = f A f B + D f aB = f a f B - D f Ab = f A f b - D f Ab = f a f b + D where D is the LD coefficient D = f AB Х f ab – f aB Х f Ab or D = f AB – f A Х f B SNP 1 SNP 2
Linkage equilibrium Linkage Disequilibrium (LD) Aa B0.25 0.5 b0.25 0.5 SNP 1 SNP 2 Linkage disequilibrium Aa B0.50 b0 SNP 1 SNP 2 where D is 0.25 D = f AB Х f ab – f aB Х f Ab = 0.5 Х 0.5 – 0 Х 0= 0.25 D = f AB – f A Х f B = 0.5– 0.5 Х 0.5= 0.25
Linkage equilibrium Linkage Disequilibrium (LD) Aa B0.120.480.6 b0.080.320.4 0.20.8 SNP 1 SNP 2 Linkage disequilibrium Aa B0.020.580.6 b0.180.220.4 0.20.8 SNP 1 SNP 2 where D is -0.10 D = f AB Х f ab – f aB Х f Ab = 0.02 Х 0.22 – 0.58 Х 0.18= -0.10 D = f AB – f A Х f B = 0.02– 0.6 Х 0.2= -0.10
Assessing LD 1.Measuring it with some parameter (we have just seen D) 2.Testing statistically whether it exists: random association or LD?
The Measure of LD D coefficient is dependent on marginal allele frequencies in contingency table. This limitation disqualifies D as a useful measure of association because it is data dependent and cannot be compared for different SNPs or populations. However D can be normalised to D’ making it comparable across pops and SNPs.
The D’ measure of LD D’ = D/D max D max = the absolute max. D value or; D max = min (f A f b,f a f B ) when D > 0 D max = min (f A f B,f a f b ) when D < 0
Example D’ Calculation f AB = 0.765 f aB = 0.235 f Ab = 0.167 f Ab = 0.833 f A = 0.52 f a = 0.48 f B = 0.59 f b = 0.41 Hap FreqsAllele Freqs D = (0.765 Χ 0.833) – (0.167 Χ 0.235) = 0.025 Since D is positive (>0) D max = min (0.52 Χ 0.41, 0.48 Χ 0.59) D max = 0.2132 D’ = D/D max = 0.025/0.2132 Thus, D’ = 0.117
Interpretation of D’ coefficient D’ = 1 (perfect positive LD between SNP alleles) D’ = 0 (linkage equilibrium or no association between SNP alleles D’ = -1 (perfect negative LD between SNP alleles) D’ = 0.87 (strong positive LD between SNP alleles D’ = 0.12 (weak positive LD between SNP alleles Significance (P-value) for D’ is determined from Chi-squared distribution D’ is constrained between –1 and +1 where;
D’=0 means no LD |D’| = 1 means complete LD Careful: can be 1 when 3 haplotypes are present 0Abb aBABB aA Interpretation of D’ coefficient
LD Plots of Adjacent SNPs LD varies significantly across genomic regions
The r 2 measure of LD Disequilibrium coefficient r 2 (sometimes also denoted by D 2 ) represents the statistical correlation between 2 sites. Consider two biallelic loci on the same chromosome, with alleles A and a at the first locus and with alleles B and b at the second locus. The allele frequencies will be written as p A, p a, p B, and p b, and the four haplotype frequencies will be written as p AB, p Ab, p aB, and p ab. Then:
x y Aa B b x y Aa B b r 2 is related to D, of course
r 2 vs. D’ Both measures are 1 in case of complete disequilibrium and 0 is there is no LD. But r 2 = 1 corresponds to situation where 2 haplotypes are present (out of possible four), while D’ is less certain and D’=1 can reflect 2 or 3 haplotypes present.
The representation of LD 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3.5 Distance Between SNPs (Base Pairs) Linkage Disequilibrium D' 10kb5kb20kb80kb40kb160kb This is just Excel, say.
where: are the degrees of freedom n is sample size t is the observed likelihood ratio (LR) is the average LR in a permuted distribution is the standard deviation of the permuted LR distribution Zhao et al., Ann. Hum. Genet., 63:167-179,1999
FNF: fraction not found. Based in the fact that LD reduces the number of haplotypes Slatkin, Genetics 154:1367-1378, 2000 Mateu et al., Am. J. Hum. Genet. 68:103-117, 2001 K e : expected number of haplotypes under linkage equilibrium, given the allele frequencies and the sample size K o : observed number of haplotypes K min : minimum possible number of haplotypes
OK, fine… But, why is there any LD at all in the genome?
The origins of Linkage Disequilibrium Variations in Chromosomes within a Population Common Ancestor Emergence of Variations (assume no recombination) timepresent Disease Mutation
So LD is the basis of, for example, association studies (you’ll see more about this later...). And we can go even deeper: LD decays with recombination D t+1 =(1- )D t
LD is a function of distance Distances: 1)Physical distances between alleles are base-pairs. 2) Measure of distance based on the probability of recombination, the unit is called Morgan. - A distance of 1 centiMorgan (cM) between two alleles means that they have 1% chances of being separated by recombination. - In humans, a genetic distance of 1 cM is roughly equal to a physical distance of 1 million base pairs (1Mbp).
Time = present LD decays with recombination 2,000 gens. ago Disease-Causing Mutation 1,000 gens. ago Section 1
(Think Finland: 1000 founders 2000 years ago; consistent expansion) Few (maybe none) reoccurrences of disease-causing mutation (Think Earth: 10,000 "founders" (N e ); 100,000 years ago) Assume old mutations cause common diseases
Whait a sec, these are the haplotypes, right? Variations in Chromosomes within a Population Common Ancestor Emergence of Variations (assume no recombination) timepresent Disease Mutation
And remember one can select tag-SNPs… a. Short stretch of DNA for 4 different people – 3 SNPs are present b. Haplotypes made up of a combination of different alleles at 20 nearby SNPs c. Genotyping just 3 “tag” SNPs can distinguish all 4 haplotypes
Cool!!! Nowadays, we can massively genotype individuals. We could potentially “cover” the whole genome using the property of LD and a few tag-SNPs…and..and…and… But… How many SNPs to tag all the genome? And…can we easily ascertain individual haplotypes?
Haplotyping: Phase Problem Observed: SNP1 G/T SNP2 A/C Possible Haplotypes: GA, TC or GC, TA n SNPs 2 n possible haplotypes GA TC SNP1SNP2 Diploid
The Problem It’s not yet easy to measure an individual’s (only two) haplotypes Molecular haplotyping (nucleotide sequencing) is the gold standard A more efficient strategy: –Focus on regions, such as certain genes –Estimate haplotypes from SNP data (genotypes) –Use LD map, and reduce the number of loci to represent the haplotype –Use haplotype map (DB) = key SNPs + haplotype blocks with strong LD
Molecular Haplotyping Hetero-duplex analysis, mismatch detection, allele-specific PCR: Have potential to get high-throughput Only practical for short haplotypes (2-5 kb vs. 50-100kb) Costly Rolling Circle amplification method, etc: Can handle larger size Difficult to automate
In-silico Haplotyping: Two Tasks I.Reconstruction of the haplotypes of the sampled individuals. II. Estimation of haplotypes frequencies in a population.
In-silico Haplotyping: Approaches 1) Clark’s algorithm 2) E-M algorithm (expectation-maximization algorithm) 3) Bayesian algorithm Message: many different approaches
How Far Does Association (LD) Extend Between Neighboring Common Sites? 0kb 160kb 80kb40kb20kb10kb5kb Range of uncertainty Theoretical (given 1cM/Mb): 3-8 kb but…
Strategy for Assessing Extent of LD 19 regions 44 Caucasian samples from Utah a great deal of DNA sequencing per sample Distance from core single nucleotide polymorphism (SNP) 5510204080 0kb 160kb 80kb40kb20kb10kb5kb
MYSTERY: What explains the long-range LD? Maybe an important event in population history? LD and population genomics
Positive Control: 48 Swedes Identical pattern to Utah
96 Nigerians (Yoruba) Much Less LD Associations in Africans a SUBSET of those in Caucasians MUST be influenced by population history
Confirmation of less LD in Africans from Direct DNA Sequencing
More evidence from Genotyping ~5,000 SNPs (Gabriel et al. 2002)
Explanation: Bottleneck or ‘Founder Effect’ in History of North Europeans What was this event? (1) Out of Africa? Ancestral Population North Europeans Likely <10 founding chromosomes ~100,000 years ago Yoruba Ancestors (2) Founding of Europe?
Given the demographic properties of LD, which populations are best suited for association-based mapping studies? - LD reflects the ages of haplotypes in populations. - Population founded more recently is useful for detecting long-range associations between disease- causing mutations and marker SNPs. -Older populations are useful for fine-scale mapping. - But things are always more complex…
LD varies substantially across the genome! Maybe clearer this way:
MYSTERY: What explains the huge genomic variance in LD distribution? Maybe a lot of intra-genomic diversity. Maybe haplotype blocks? LD and population genomics
...it is not simple even within a population: a patterned structure of recombination in the genome can create blocks of LD
Haplotype Blocks The human genome may be defined as regions of high LD called haplotype blocks These are separated by smaller regions of low LD usually attributed to recombination hotspots A haplotype block consists of a few common haplotypes that account for a large DNA segment
Haplotype Blocks -Each row represents a SNP -Blue dot = major -yellow = minor -Each column represents a single chromosome -The 147 SNPs are divided into 18 blocks defined by black lines. -The expanded box on the right is a SNP block of 26 SNPs over 19kb of genomic DNA. The 4 most common of 7 different haplotypes include 80% of the chromosomes, and can be distinguished with 2 SNPs Chromosomes SNPSSNPS Haplotype Block
These would be blocks: Map High recombination High LD
... Likely to be caused by recombination hotspots (but things are not that easy)
So what we need is an haplotype map The Haplotype Map, HapMap, will be a map of these haplotype blocks and the SNPs that identify the haplotypes. The HapMap will be a key resource for finding genes that contribute to disease risk and drug response.
What is the HapMap? The HapMap is a catalogue of common genetic variants (SNPs) that occur in humans. What information does the HapMap provide? the characteristics of the SNPs (sequence variation, allele freqs) where they occur in the genome (relative positions) how they are distributed in human populations (LD and haplotype blocks)
Aims of the Hapmap Project To develop a map of the human genome that describes the common patterns of DNA sequence variation (haplotypes) For use in establishing connections between genetic variants and disease. Populations sampled (n = 270 people) –African –Asian –European 614030 SNPs genotyped (55 million genotypes)
The Construction of the HapMap Three main steps; 1. SNPs are identified in different individuals from different ethnic groups 2. Adjacent SNPs that are inherited together are compiled into haplotypes 3. SNPs that uniquely represent haplotypes ie. tagSNPs are identified for use in genetic association studies of disease
You’ll see a lot about this later. Let me just tell you a couple of things: LD and disease
Gene mapping by linkage in an dominant mendelian disease: only recombination events in the families carry information to narrow down the location of the gene
In LD mapping, all the recombination events in the history of the disease are used to find the gene region
LD and complex diseases: LD between a marker and the (unknown) genetic variant contributing to the disease underlies the association approach (i.e., comparing allele frequencies between cases and controls)
Power to detect association improves when using haplotypes:
In summary: Knowing about LD and Haplotype Blocks empowers us to detect association between markers and disease and to perform many linkage-based disease studies.