Presentation is loading. Please wait.

Presentation is loading. Please wait.

Centro Nacional Genotipado Análisis bioinformático de secuencias y expresión de genes y genomas. Human SNPs. Teoria y prácticas. Arcadi Navarro Madrid,

Similar presentations


Presentation on theme: "Centro Nacional Genotipado Análisis bioinformático de secuencias y expresión de genes y genomas. Human SNPs. Teoria y prácticas. Arcadi Navarro Madrid,"— Presentation transcript:

1 Centro Nacional Genotipado Análisis bioinformático de secuencias y expresión de genes y genomas. Human SNPs. Teoria y prácticas. Arcadi Navarro Madrid, 6 de Abril de 2006

2 ¿Qué es el CeGen? El CeGen es una plataforma tecnológica, iniciativa de GENOMA ESPAÑA, que tiene por objetivo proporcionar los elementos de conocimiento y la infraestructura necesaria para realizar proyectos de genotipado de SNPs (Single Nucleotide Polymorphisms) a gran escala y bajo coste. Los destinatarios de esta iniciativa son los grupos de investigación de universidades, hospitales, centros de investigación e industria. El CeGen pretende contribuir a dar un salto cualitativo y cuantitativo en la investigación mediante servicios de alto valor añadido proporcionados desde España.

3 Los SNPs Single Nucleotide Polymorphisms Frecuentes. Bien distribuidos. Estables. Funcionales?. Permiten procesamiento a gran escala.

4 What is a SNP (Single Nucleotide Polymorphism)? A SNP is a position in a genome at which two or more different bases occur in the population, each with a frequency >1%. GATTTAGATCGCGATAGAG GATTTAGATCTCGATAGAG - SNPs are the most common type of variations (genetic markers). - There are only two variants for SNP: G-T or A-C

5 Interés científico Hay una gran demanda de analizar SNPs en gran cantidad y las tecnologías disponibles son costosas y muy variadas. –El CeGen ofrece desde España servicios de genotipado a gran escala a bajo coste y adaptados a cada necesidad. –El CeGen ofrece soporte científico y acceso a unas tecnologías para proyectos que de otro modo requerirían mucho más tiempo y recursos A más facilidad para genotipar, se podrán abordar proyectos más ambiciosos y en mayor número Ej. En estudios de asociación donde se requiere genotipar muchos individuos (casos, controles): –Posibilidad de whole genome scan –Proyectos de BioBank

6 Interés estratégico El CeGen puede ayudar en la solicitud de proyectos de investigación que impliquen genotipado a varios niveles: –Proponiendo estrategias de selección SNPs –Ofreciendo las tecnologías a gran escala que dan viabilidad al proyecto –Calculando presupuestos que pueden incorporarse a la solicitud

7 Volvamos a los SNPs Frecuentes. Bien distribuidos. Estables. Funcionales?. Permiten procesamiento a gran escala.

8 Terminology Allele is one of a number of alterative forms of the same gene occupying a given locus. If we are considering SNPs, an allele is one of two alternative forms. Locus is physical location of allele on the chromosome Haplotype is a set of alleles that tend to be inherited together (not easily separable by recombination). Example: Consider 2 loci, each with two possible alleles, the first locus being either A or a, the second locus being B or b. Then the genotype of an individual have 4 possible haplotypes: AB, Ab, aB, ab.

9 tttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatgg cagagaacacaaaatatggccagtggcctaaatccagcctactaccttttttttttttttgtaacattttacta acatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcagagttaagaaattgtag cagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaa cttatcattatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatc ctccaactttgtttttaatcaaaagtgttttggccatcctagatatactttgtattgccacataaatttgaaga tcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaatctatagattaattag aggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccacc ccccactcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctca agtccatataaaatgccatactatttgcatataacctctgcaatcctcccctatagtttagatcatctctagat tacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgttgagggttttttgttttgtt ttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtgg tgagatcatagcttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctg ggatacaggtgtgacccactgtgcccagttattattttttatttgtattattttactgttgtattatttttaat tattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggagggctaac tgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtt tacttttcagtatacaagactttgacgttttttgttaaatgtatttgtaagtattttattatttgtgatgttat ttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagcactttgggaggctgaggcgggca gatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatagaaaaaa attagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacc tgggaggcggagcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtc aaaaaaataaataaaatttaaaaaaagaagaagaaattattttcttaatttcattttcaggttttttatttatt tctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagctaacatagtttatta tttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttcttt cttttcaacctagactggatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactaga gaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatccctgctttccccctgattttaggg ggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttg aggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctca tctc gaga gaga gaga gaga gaga gcgc gcgc gcgc tctc gaga gaga gaga gaga gaga tctc tctc tctc tctc gaga gaga gaga tctc gcgc tctc tctc tctc

10

11

12 http://www.ncbi.nlm.nih.gov/About/primer/snps.html

13 ¿Para qué sirven los SNPs? (I) Genotipado en humanos –Búsqueda de genes de susceptibilidad para enfermedades –Diagnóstico / pronóstico –Metabolismo de fármacos –Reacciones adversas a fármacos (medicina personalizada) –Reacción ante factores ambientales –Genética forense –Estructura y dinámica del genoma –Evolución del genoma

14 ¿Para qué sirven los SNPs? (II) Genotipado en otras especies –Identificación microbiana –Análisis de comunidades microbianas –Genotipación en levaduras –Uso similar a humanos en ratón y otros organismos modelo –Gran uso en especies domésticas (QTL) tanto animales como vegetales –Identificación de variedades vegetales Su uso se expandirá con nuevos mapas de SNPs

15 SNPs in the Human Genome All humans share 99.9% the same genetic sequence –SNPs occur about every 1000 base pairs 90% of human genome variation comes from SNPs The human genome contains 10 million validated SNPs and 21 million submited. –~340,000 SNPs are found in genes SNPs are not evenly spaced along the sequence –SNP-rich regions –SNP-poor regions

16 What is a Haplotype? A haplotype is a sequence of alleles stretching along an extended segment of DNA – a sort of super allele! Haplotypes are usually inherited as a single unit from parents. AB ab haplotype Ab aB haplotype Aa Bb

17 Alleles of Adjacent SNPs on a Chromosome form Haplotypes a. Short stretch of DNA for 4 different people – 3 SNPs are present b. Haplotypes made up of a combination of different alleles at 20 nearby SNPs c. Genotyping just 3 “tag” SNPs can distinguish all 4 haplotypes

18 What is Linkage Disequilibrium? Linkage Disequilibrium (LD) is nonindependence (nonrandomness) of alleles at different sites (different SNPs for the rest of the session). Example: Suppose that allele A at locus 1 and allele B at locus 2 are at frequencies p A and p B, respectively, in the population. If the two loci are independent, then we would expect to see the AB haplotype at frequency p A p B. If the population frequency of the AB haplotype is either higher or lower than this—implying that particular alleles tend to be observed together—then the two loci are said to be in LD.

19 A B A b a B a b Eg. Two adjacent SNPs (A and B) are genotyped in a population. Linkage “Equilibrium” There are 4 possible haplotypes Aa Bf AB f aB fBfB bf Ab fbfb fAfA fafa Under linkage equilibrium we have what we expect f AB = f A f B f aB = f a f B f Ab = f A f b SNP 1 SNP 2

20 A B A b a B a b Eg. Two adjacent SNPs are genotyped in a population Linkage Disequilibrium (LD) There are 4 possible haplotypes Aa Bf AB f aB fBfB bf Ab f ab fbfb fAfA fafa Under linkage disequilibrium we have different results f AB = f A f B + D f aB = f a f B - D f Ab = f A f b - D f Ab = f a f b + D where D is the LD coefficient D = f AB Х f ab – f aB Х f Ab or D = f AB – f A Х f B SNP 1 SNP 2

21 Linkage equilibrium Linkage Disequilibrium (LD) Aa B0.25 0.5 b0.25 0.5 SNP 1 SNP 2 Linkage disequilibrium Aa B0.50 b0 SNP 1 SNP 2 where D is 0.25 D = f AB Х f ab – f aB Х f Ab = 0.5 Х 0.5 – 0 Х 0= 0.25 D = f AB – f A Х f B = 0.5– 0.5 Х 0.5= 0.25

22 Linkage equilibrium Linkage Disequilibrium (LD) Aa B0.120.480.6 b0.080.320.4 0.20.8 SNP 1 SNP 2 Linkage disequilibrium Aa B0.020.580.6 b0.180.220.4 0.20.8 SNP 1 SNP 2 where D is -0.10 D = f AB Х f ab – f aB Х f Ab = 0.02 Х 0.22 – 0.58 Х 0.18= -0.10 D = f AB – f A Х f B = 0.02– 0.6 Х 0.2= -0.10

23 Linkage Disequilibrium (LD)

24 Assessing LD 1.Measuring it with some parameter (we have just seen D) 2.Testing statistically whether it exists: random association or LD?

25 The Measure of LD D coefficient is dependent on marginal allele frequencies in contingency table. This limitation disqualifies D as a useful measure of association because it is data dependent and cannot be compared for different SNPs or populations. However D can be normalised to D’ making it comparable across pops and SNPs.

26 The D’ measure of LD D’ = D/D max D max = the absolute max. D value or; D max = min (f A f b,f a f B ) when D > 0 D max = min (f A f B,f a f b ) when D < 0

27 Example D’ Calculation f AB = 0.765 f aB = 0.235 f Ab = 0.167 f Ab = 0.833 f A = 0.52 f a = 0.48 f B = 0.59 f b = 0.41 Hap FreqsAllele Freqs D = (0.765 Χ 0.833) – (0.167 Χ 0.235) = 0.025 Since D is positive (>0) D max = min (0.52 Χ 0.41, 0.48 Χ 0.59) D max = 0.2132 D’ = D/D max = 0.025/0.2132 Thus, D’ = 0.117

28 Interpretation of D’ coefficient D’ = 1 (perfect positive LD between SNP alleles) D’ = 0 (linkage equilibrium or no association between SNP alleles D’ = -1 (perfect negative LD between SNP alleles) D’ = 0.87 (strong positive LD between SNP alleles D’ = 0.12 (weak positive LD between SNP alleles Significance (P-value) for D’ is determined from Chi-squared distribution D’ is constrained between –1 and +1 where;

29 D’=0 means no LD |D’| = 1 means complete LD Careful: can be 1 when 3 haplotypes are present 0Abb aBABB aA Interpretation of D’ coefficient

30 LD Plots of Adjacent SNPs LD varies significantly across genomic regions

31 The r 2 measure of LD Disequilibrium coefficient r 2 (sometimes also denoted by D 2 ) represents the statistical correlation between 2 sites. Consider two biallelic loci on the same chromosome, with alleles A and a at the first locus and with alleles B and b at the second locus. The allele frequencies will be written as p A, p a, p B, and p b, and the four haplotype frequencies will be written as p AB, p Ab, p aB, and p ab. Then:

32 x y Aa B b x y Aa B b r 2 is related to D, of course

33 r 2 vs. D’ Both measures are 1 in case of complete disequilibrium and 0 is there is no LD. But r 2 = 1 corresponds to situation where 2 haplotypes are present (out of possible four), while D’ is less certain and D’=1 can reflect 2 or 3 haplotypes present.

34 The representation of LD 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3.5 Distance Between SNPs (Base Pairs) Linkage Disequilibrium D' 10kb5kb20kb80kb40kb160kb This is just Excel, say.

35 The representation of LD

36 This is HelixTree software screen shot: D’= 0 is blue, D’ = 1 is red.

37 The representation of LD The Graphic Overview of Linkage Disequilibrium (GOLD) software package http://www.sph.umich.edu/ csg/abecasis/GOLD / Map High recombination High LD

38 The representation of LD High recombination High LD Haploview

39  2 Fisher’s exact test LD tests:LD for k alleles: D m Fisher’s exact test  2 LD for n loci:  FNF Testing for LD

40

41 where: are the degrees of freedom n is sample size t is the observed likelihood ratio (LR)  is the average LR in a permuted distribution  is the standard deviation of the permuted LR distribution Zhao et al., Ann. Hum. Genet., 63:167-179,1999

42 FNF: fraction not found. Based in the fact that LD reduces the number of haplotypes Slatkin, Genetics 154:1367-1378, 2000 Mateu et al., Am. J. Hum. Genet. 68:103-117, 2001 K e : expected number of haplotypes under linkage equilibrium, given the allele frequencies and the sample size K o : observed number of haplotypes K min : minimum possible number of haplotypes

43 OK, fine… But, why is there any LD at all in the genome?

44 The origins of Linkage Disequilibrium Variations in Chromosomes within a Population Common Ancestor Emergence of Variations (assume no recombination) timepresent Disease Mutation

45 So LD is the basis of, for example, association studies (you’ll see more about this later...). And we can go even deeper: LD decays with recombination D t+1 =(1-  )D t

46 LD is a function of distance Distances: 1)Physical distances between alleles are base-pairs. 2) Measure of distance based on the probability of recombination, the unit is called Morgan. - A distance of 1 centiMorgan (cM) between two alleles means that they have 1% chances of being separated by recombination. - In humans, a genetic distance of 1 cM is roughly equal to a physical distance of 1 million base pairs (1Mbp).

47 Time = present LD decays with recombination 2,000 gens. ago Disease-Causing Mutation 1,000 gens. ago Section 1

48 (Think Finland: 1000 founders 2000 years ago; consistent expansion) Few (maybe none) reoccurrences of disease-causing mutation (Think Earth: 10,000 "founders" (N e ); 100,000 years ago) Assume old mutations cause common diseases

49 Whait a sec, these are the haplotypes, right? Variations in Chromosomes within a Population Common Ancestor Emergence of Variations (assume no recombination) timepresent Disease Mutation

50 And remember one can select tag-SNPs… a. Short stretch of DNA for 4 different people – 3 SNPs are present b. Haplotypes made up of a combination of different alleles at 20 nearby SNPs c. Genotyping just 3 “tag” SNPs can distinguish all 4 haplotypes

51 Cool!!! Nowadays, we can massively genotype individuals. We could potentially “cover” the whole genome using the property of LD and a few tag-SNPs…and..and…and… But… How many SNPs to tag all the genome? And…can we easily ascertain individual haplotypes?

52 Haplotyping: Phase Problem Observed: SNP1 G/T SNP2 A/C Possible Haplotypes: GA, TC or GC, TA n SNPs  2 n possible haplotypes GA TC SNP1SNP2 Diploid

53 The Problem It’s not yet easy to measure an individual’s (only two) haplotypes Molecular haplotyping (nucleotide sequencing) is the gold standard A more efficient strategy: –Focus on regions, such as certain genes –Estimate haplotypes from SNP data (genotypes) –Use LD map, and reduce the number of loci to represent the haplotype –Use haplotype map (DB) = key SNPs + haplotype blocks with strong LD

54 Molecular Haplotyping Hetero-duplex analysis, mismatch detection, allele-specific PCR:  Have potential to get high-throughput  Only practical for short haplotypes (2-5 kb vs. 50-100kb)  Costly Rolling Circle amplification method, etc:  Can handle larger size  Difficult to automate

55 In-silico Haplotyping Alias: Haplotype Reconstruction, Haplotype Inference, Computational Haplotyping, Statistical Haplotyping, etc. Advantages:  Cost effective  High-throughput Difficulty:  Phase Ambiguity: Haplotypes increase exponentially with SNPs

56 In-silico Haplotyping: Two Tasks I.Reconstruction of the haplotypes of the sampled individuals. II. Estimation of haplotypes frequencies in a population.

57 In-silico Haplotyping: Approaches 1) Clark’s algorithm 2) E-M algorithm (expectation-maximization algorithm) 3) Bayesian algorithm Message: many different approaches

58 How Far Does Association (LD) Extend Between Neighboring Common Sites? 0kb 160kb 80kb40kb20kb10kb5kb Range of uncertainty Theoretical (given 1cM/Mb): 3-8 kb but…

59 Strategy for Assessing Extent of LD 19 regions 44 Caucasian samples from Utah a great deal of DNA sequencing per sample Distance from core single nucleotide polymorphism (SNP) 5510204080 0kb 160kb 80kb40kb20kb10kb5kb

60 How far does the signal reach? Results:

61 MYSTERY: What explains the long-range LD?  Maybe an important event in population history? LD and population genomics

62 Positive Control: 48 Swedes Identical pattern to Utah

63 96 Nigerians (Yoruba) Much Less LD Associations in Africans a SUBSET of those in Caucasians MUST be influenced by population history

64 Confirmation of less LD in Africans from Direct DNA Sequencing

65 More evidence from Genotyping ~5,000 SNPs (Gabriel et al. 2002)

66 Explanation: Bottleneck or ‘Founder Effect’ in History of North Europeans What was this event? (1) Out of Africa? Ancestral Population North Europeans Likely <10 founding chromosomes ~100,000 years ago Yoruba Ancestors (2) Founding of Europe?

67 Given the demographic properties of LD, which populations are best suited for association-based mapping studies? - LD reflects the ages of haplotypes in populations. - Population founded more recently is useful for detecting long-range associations between disease- causing mutations and marker SNPs. -Older populations are useful for fine-scale mapping. - But things are always more complex…

68 How far does the signal reach? Results:

69 LD varies substantially across the genome! Maybe clearer this way:

70 MYSTERY: What explains the huge genomic variance in LD distribution?  Maybe a lot of intra-genomic diversity. Maybe haplotype blocks? LD and population genomics

71 ...it is not simple even within a population: a patterned structure of recombination in the genome can create blocks of LD

72 Haplotype Blocks The human genome may be defined as regions of high LD called haplotype blocks These are separated by smaller regions of low LD usually attributed to recombination hotspots A haplotype block consists of a few common haplotypes that account for a large DNA segment

73 Haplotype Blocks -Each row represents a SNP -Blue dot = major -yellow = minor -Each column represents a single chromosome -The 147 SNPs are divided into 18 blocks defined by black lines. -The expanded box on the right is a SNP block of 26 SNPs over 19kb of genomic DNA. The 4 most common of 7 different haplotypes include 80% of the chromosomes, and can be distinguished with 2 SNPs Chromosomes SNPSSNPS Haplotype Block

74 These would be blocks: Map High recombination High LD

75 ... Likely to be caused by recombination hotspots (but things are not that easy)

76 So what we need is an haplotype map The Haplotype Map, HapMap, will be a map of these haplotype blocks and the SNPs that identify the haplotypes. The HapMap will be a key resource for finding genes that contribute to disease risk and drug response.

77 What is the HapMap? The HapMap is a catalogue of common genetic variants (SNPs) that occur in humans. What information does the HapMap provide? the characteristics of the SNPs (sequence variation, allele freqs) where they occur in the genome (relative positions) how they are distributed in human populations (LD and haplotype blocks)

78 Aims of the Hapmap Project To develop a map of the human genome that describes the common patterns of DNA sequence variation (haplotypes) For use in establishing connections between genetic variants and disease. Populations sampled (n = 270 people) –African –Asian –European 614030 SNPs genotyped (55 million genotypes)

79 The Construction of the HapMap Three main steps; 1. SNPs are identified in different individuals from different ethnic groups 2. Adjacent SNPs that are inherited together are compiled into haplotypes 3. SNPs that uniquely represent haplotypes ie. tagSNPs are identified for use in genetic association studies of disease

80

81 Nicotine metabolising gene

82 HapMap Genotype Data Analysis The raw genotype data can be downloaded from the HapMap website HAPLOVIEW is a useful tool for conducting genotype analysis

83

84 You’ll see a lot about this later. Let me just tell you a couple of things: LD and disease

85 Gene mapping by linkage in an dominant mendelian disease: only recombination events in the families carry information to narrow down the location of the gene

86 In LD mapping, all the recombination events in the history of the disease are used to find the gene region

87 LD and complex diseases: LD between a marker and the (unknown) genetic variant contributing to the disease underlies the association approach (i.e., comparing allele frequencies between cases and controls)

88 Power to detect association improves when using haplotypes:

89 In summary: Knowing about LD and Haplotype Blocks empowers us to detect association between markers and disease and to perform many linkage-based disease studies.


Download ppt "Centro Nacional Genotipado Análisis bioinformático de secuencias y expresión de genes y genomas. Human SNPs. Teoria y prácticas. Arcadi Navarro Madrid,"

Similar presentations


Ads by Google