Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genetic Analysis.

Similar presentations


Presentation on theme: "Genetic Analysis."— Presentation transcript:

1 Genetic Analysis

2 Genotype / Phenotype We are interested in finding genotypes that explain or predict phenotypes, the manifestations of heritable characteristics. Disease (and susceptibility) are key phenotypes Genotype is generally hard to assay, so we are interested in finding genetic markers Genetic markers are common characters, inherited in sufficiently simple way for us to follow. [Race, 1965]

3 Linkage Analysis Genes near each other on a chromosome tend to be inherited together, that is, they are linked. Linkage analysis are the techniques used to identify such linkages among genes Linkage groups which include genetic markers and genes determinative of phenotype allow the identification of determinative alleles (and therefore prediction)

4 Finding genes for human disease
Positional cloning. Find a location in the genome that is inherited with a disease Basic concepts: Polymorphism A region of genome that varies across individuals SNP: single nucleotide polymorphism (say “snip”) Haplotype: a set of closely linked polymorphisms (usually SNPs) that are inherited as a unit Linkage analysis: Use “markers” (genetic regions with widely distributed polymorphisms) to identify chromosomal regions that are inherited (“cosegregate”) with the disease.

5 Mendelian Inheritance
Each sexually reproducing organism has two alleles for each gene, one from each parent If the two alleles are the same, the phenotype reflects it. These organisms are called homozygotic for that allele. If the two alleles are different, the phenotype reflects the dominant allele. These organisms are heterozygotic. The allele that is not dominant is called recessive. Recessive alleles are reflected in the phenotype only when they are homozygotic.

6 Linkage Mendel showed that alleles segregate independently. Then he tested genes Sometimes inheritance of two genes are independent of another, that is phenotype ratios are 9:3:3:1 Sometimes inheritance of two genes are linked together, showing a ratio of 3:0:0:1 Linkage can vary continuously from perfectly correlated to uncorrelated.

7 Why genes are linked Alleles are arranged linearly
Each parent passes only one of its two chromosomes to an offspring. Recombination periodically switches which chromosome in the parent is passed along Alleles near each other are more likely to be passed along than ones further apart Alleles on different chromosomes are always inherited independently.

8 Recombination picture
Crossover is the alternation of allele generating chromatid (half of chromosome)

9 Finding disease genes Create collections of genetic markers (easily detectable polymorphisms with known chromosomal position). Genome Maps Assemble families with affected members. Look for patterns of markers that correlate with disease incidence Place disease gene in ordering relationship with markers. If markers are close together, we know where the gene is (or is likely to be)

10 Linkage analysis process
Find sets of related patients Assemble patients into pedigrees (family genologies) Genotype patients and other family members for polymorphic markers Look for markers that are inherited with the disease; these are said to cosegregate. Identify the chromosomal region(s) where they are (or are not) located

11 Pedigrees A pedigree is a tabulation of the appearance of a trait (disease or marker) in an extended family, along with the family's mating history.

12 Some assumptions Start by looking for a single gene with Mendelian inheritance, but... Quantitative traits are polygenic Most widespread inherited diseases are polygenic Incomplete penetrance Kinds of pedigrees: Large, extended families (but few of them) Best for single gene, Mendelian inheritance Sib-pairs (but a lot of them) Best for complex traits, incomplete penetrance.

13 Recombination Individual i is recombinant with respect to a parent p if i inherits the A allele from one chromatid of p and B from the other. Recombination fraction AB = probability that a child is recombinant Linkage between A and B  low AB

14 LOD Scores (simple model)
Let R  number of recombinant offspring, P  total offspring Model: Likelihood of the data = R(1- )P-R Maximum likelihood estimate of  = R/P LOD score for linkage in a pedigree is ^

15 Two point parametric linkage analysis
Two point because we look at the disease versus a marker (one marker at a time) Parametric because we are making a distributional assumption to get the likelihood of the data given , R and P For a set of markers covering a whole genome at moderate spacing, calculate LOD scores. LODs > 3 (1000:1) give a general sense of where the disease gene is likely to be (>5 very strong)

16 Some complications Which loci are recombinant? Need to know genotypes of parents and phase. Other parameterized models of non-Mendelian inheritance Non-parametric models Linkage disequilibrium

17 Phase If disease gene D/d is linked to marker A/a, the disease could segregate either with A or a. We can't tell who is recombinant and who isn't unless we know whether the mother or father donated the allele Phase I (D|a): d|a (R), D|a (N), D|a (N), d|A (N), d|A (N) Phase II (D|A): d|a (N), D|a (R), D|a (R), d|A (R), d|A (R) Can average over phases (losing power) Aa AA

18 Informativeness of family data
Sometimes, the addition of one more family member can dramatically improve results E.g. allowing inference of phase Sometimes, an additional family member doesn't provide any new information Second parent when are both homozygous for the marker allele. Can estimate the expected informativeness of various kinds of additional data...

19 Realistic models Most genetics is not Mendelian. Relationship between genotype and phenotype complex. Most general form for likelihood function: where Penetrance(X|G) is the likelihood of observing the trait X given the genotype G, Prior(G) is the probability of observing the genotype spontaneously in member j, and Transmit(Gm|Gk,Gl) is the probability that an offspring will have genotype Gm given that the parents had genotypes Gk and Gl, parameterized by  (and a mapping function). Easy to write, hard to compute MLE.

20 LOD graph Can look at LOD score over a range of 's, not just MLE.
Usual assumption is LOD > 3 is evidence for linkage, LOD < -2 is evidence for exclusion LOD .1 .2 .3 .4 .5 3 2 1 -1 -2 -3

21 Recombination probability versus genetic distance
Recombination does not depend linearly on genetic distance between two loci Over long enough distances, we get multiple recombination events. Transmission probability proportional to genetic distance, not recombination ratio. We can estimate genetic distance from recombination ratios. Genetic distance is necessary for positional cloning

22 Genetic Distance AB The expected number of crossovers between locus A and locus B Only defined for loci on the same chromosome Genetic distance is measured in Morgans  AB +  BC =  AC Total number of crossovers per genome is twice the number per gamete...

23 From recombination to distance
Differences between AB and  AB AB + BC ≠ AC AB  AB can be > 1 AB = 0.5 when A and B are on different chromosomes AB <  AB(assuming same chromosome) For small values AB = ½ p ( AB > 0) ≈  AB Want invertible mapping function f(AB) =  AB Used to calculate AC = f-1(f(AB) + f(BC))

24 Some mapping functions
Haldane's mapping function  AB = - ½ ln(1- 2 AB) , where AB is in cM Assumes crossovers are random and independent Kosambi's mapping function  AB = ¼ ln[(1+ 2AB)/(1- 2AB)] , where  AB is in cM Models interference: crossovers cannot happen too close to each other. Most popular

25 From genetic distance to physical distance
Genetic distance does not have a simple mapping to physical (sequence) distance Males and females have different numbers of crossovers per gamete. Men: 28.51M over the whole genome Women: 42.96M (excluding X) Dividing by 3Gb/genome gives Average of 1.05Mb/cM for men, 0.88Mb/cM for women And there are individual and genomic region differences Drosophila are about 400kb/cM

26 Penetrance Not everyone with the disease genotype will have the disease phenotype (e.g. be sick) Delayed (e.g. adult) onset Mild (even undetectable) symptoms Interaction with environmental stimuli Can be random factors as well disease allele  increased probability of disease But then we don't know who in a pedigree has the disease genotype!

27 Penetrance model Define penetrance with three parameters
fDD= p(disease|DD) fDd= p(disease|Dd), fdd= p(disease|dd) fDD= fDd = 1, fdd = 0 is Mendelian dominant fDD = 1, fDd= fdd = 0 is Mendelian recessive fdd>0 means spontaneous mutations fDD< 1 means incomplete penetrance

28 Finding the MLE  ^ Often exhaustive search over 0<  < ½ leaving all other parameters fixed. Some other parameters are independently estimable, e.g. founder frequencies from general population estimates. Sometimes more complex optimization Particularly including penetrance...

29 Whole genome scans Pick 150-400 highly polymorphic markers
If using SNPs, need more markers due to low polymorphism (only 4 possible states) and therefore low informativeness Do two point analyses for each marker Look at regions with highest LOD scores (>1) more finely Use more closely spaced markers in just those regions. Do NOT compare LOD scores of markers – compare LOD scores for regions.

30 Multipoint linkage analysis
Tries to establish order relationships among three (or more) loci. Typically three point: place a disease between two (ordered!) markers. Can use more than pairs of markers... More reliable than two point. Hypotheses are vector of 's DM1M2 (D1, 12) vs. M1DM2 (1D, D2) vs. M1M2D (12,2D) Calculate MLE 's, calculate likelihood of each ordering Accept hypothesis with LOD score > 3 compared to next best other ordering. ^

31 Transcending model-based linkage
Linkage analysis based on a specific model of inheritance (e.g. Mendelian) works for simple inheritance patterns and large pedigrees. However many important diseases do not fit this paradigm Traits with complex inheritance patterns, e.g. quantitative. Bias toward finding Mendelian traits that have limited relevance to population scale incidence (e.g. BRCA1) Non-random incomplete penetrance (e.g. any disease with a significant environmental component, say, cancer) Alternative approaches?

32 Non-parametric linkage analysis
No specific probability model of linkage Work with pairs of affected siblings. Random sib pairs will share two alleles of a DNA marker 25% of the time, one allele 50% and no alleles 25%. If the marker is linked to the disease gene then alleles will be shared between affected sib pairs more often Easier to find and genotype large numbers of affected siblings than large pedigrees Less statistical power than model-based methods...

33 Identity by descent Alleles can be shared among siblings for two reasons: Both inherited the allele from one parent (the same as transmitted the disease) The alleles are identical by chance – there are relatively few polymorphisms of the allele (e.g. 4) and there is a reasonable chance for random matches. Genotyping parents (triads) means inheritance (IBD or not) of the marker alleles can be studied directly

34 Non-parametric linkage (con't)
Generally done on pairs of siblings, but can be other relatives (e.g. ¼ chance of sharing alleles between grandparent/grandchild). Need “both affected” and “only one affected” pairs Estimate number of IBD alleles for a marker. Correlate the number of shared (IBD) alleles of a marker with the affected state. High correlation means linkage to the disease.

35 More on NPL... Originally designed for quantitative genotypes (e.g. length of RFLPs) In qualitative alleles, it's just counting whether the proportion of alleles of a possible marker shared IBD is greater in the “both affected” than in the “only one affected” conditions. Model free, but not without assumptions... Mendelian segregation frequencies, no inbreeding, assumption of population frequency of alleles, etc.

36 Still more on NPL... Kruglyak and Lander showed how to use non-parametric approaches and multiple markers to do all same tasks as parametric approach: Estimating recombination fraction, multipoint ordering or markers, etc. GeneHunter program widely used. Not as good as parametric studies at precisely placing the disease gene (flatter LOD curves) Hybrid models, called “Variance Components”

37 Association studies Look to see if presence/absence of a marker allele is correlated with a disease over a large population. Case/Control study, no family analysis Two ways this can work Marker is in disease gene haplotype Linkage disequilibrium: larger regions of chromosome that tend to be inherited together due to recent evolutionary history Only works for markers close to the disease

38 Linkage disequilibrium
Controversial and important question Are some relatively large regions of the human genome inherited as a unit, or does recombination effectively scramble everything Large studies indicate different amounts of LD in different populations Europeans have a lot, Africans very little. Due to population bottleneck in Europe, or some other cause?

39 Problems with association
Hard to use for whole genome studies, since markers have to be very close to genes Compare linkage methods where evenly spaced markers provide reasonable coverage Mostly used in regions already narrowed by linkage studies. Can be high resolution. High throughput and cheap SNP genotyping may change this balance. Ultimate goal: Haplotyping, where we assay trait polymorphisms directly.

40 Quantitative Trait Loci (QTLs)
Many traits of interest are quantitative (e.g. height, maximum heart rate, etc.) Quantitative traits are the subset of polygenic traits where the components have an additive effect. Hard to identify polygenic trait loci in humans Linkage studies hard because each gene has only a small effect on phenotype Association studies have promise, but require huge genotyping efforts.

41 QTL analysis in the lab If we can control breeding patterns, then it becomes more plausible to find QTLs Crosses between inbred strains means that contributions of parental alleles (and therefore recombination patterns) are known for all markers. Plant polyploidy also provides powerful tools. Inbred crosses have disadvantages, too Wild type alleles that contribute to trait may not be present in inbred strains Reduced marker polymorphism means lower resolution

42 Association, SNPs and The Coalescent

43 Association studies Look to see if presence/absence of a marker allele is correlated with a disease over a large population. Case/Control study, no family analysis Only works for markers very close to the disease gene itself. Two ways this can happen Marker is in the disease gene (haplotype) Linkage disequilibrium: larger regions of chromosome that tend to be inherited together due to recent evolutionary history

44 Linkage disequilibrium
Controversial and important question Are some relatively large regions of the human genome inherited as a unit, or does recombination effectively scramble everything Large studies indicate different amounts of LD in different populations Europeans have a lot, Africans very little. Due to population bottleneck in Europe, or some other cause?

45 Thoughts on association
Expensive for whole genome studies, since markers have to be very close to genes Compare linkage methods where a few hundred evenly spaced markers provide reasonable coverage Mostly used in regions already narrowed by linkage studies. Can be high resolution. High throughput and cheap SNP genotyping may change this balance. Ultimate goal: Haplotyping, where we assay trait polymorphisms directly. “Tag” SNPs identify a haplotype uniquely, so we don’t assay all SNPs. Computational methods to find them…

46 Transmission disequilibrium
A model-based association test (TDT) Studies of affecteds and their parents. Parents heterozygous for the marker transmit one marker allele and either the disease allele or not. Count the number of times each allele is transmitted (or not) to the affected subjects. The non-transmitted alleles serve as a control sample Alleles that are passed with the diseases more often that expected by chance are associated with it.

47 Genetics Analysis Workshops
Like CASP for genetic epidemiology, 15th biannual GAW is underway now Competitive analysis of a multiple related datasets, including real and simulated data. In this GAW: Dataset 1: Expression arrays as phenotype, SNPs as genotype, large pedigree Datasets 2 & 3: multiple data types (microsatellites, SNPs) and multiple study designs (case/control, sib pairs) on a complex disease (polygenic, incomplete penetrance). Real and simulated data. Good web site:

48 Quantitative Trait Loci (QTLs)
Many traits of interest are quantitative (e.g. height, maximum heart rate, etc.) Quantitative traits are the subset of polygenic traits where the components have an additive effect. Hard to identify polygenic trait loci in humans Linkage studies hard because each gene has only a small effect on phenotype Association studies have promise, but require huge genotyping efforts.

49 QTL analysis in the lab If we can control breeding patterns, then it becomes more plausible to find QTLs Crosses between inbred strains means that contributions of parental alleles (and therefore recombination patterns) are known for all markers. Plant polyploidy also provides powerful tools. Inbred crosses have disadvantages, too Wild type alleles that contribute to trait may not be present in inbred strains Reduced marker polymorphism means lower resolution

50 Association informatics
The ability to genotype 500k markers in thousands of people (>1B genotypes!) puts new stresses on case/control calculations. One marker at a time is easiest, but misses linkage Coalescent (ARG) provides good estimates of haplotype size, but slow for large problems Machine learning methods for marker combinations, e.g. HapMiner Combining tree estimation and association in a single process, e.g. Blossoc

51 Selecting tag SNPs Idea is to select subset of all SNPs for genotyping without losing (much) information If there is a lot of linkage disequilibrium, then 1. Define block boundaries 2. Identify SNP combinations in the block that correlate highly with the others. If LD is not present throughout genome, need a model of how SNPs are correlated Coalescent simulations can provide estimates

52 Linkage & Association Summary
Very active area now, since high throughput, low cost genotyping is on the horizon Statistically complex area. Read the population genetics lecture notes first. (Use Holsinger if this is new to you.) Then revisit the Elston article, which is really deep. More closely tied to human health than most bioinformatics…


Download ppt "Genetic Analysis."

Similar presentations


Ads by Google