Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,

Similar presentations


Presentation on theme: "The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,"— Presentation transcript:

1 The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville, TN

2 Biology is complex BioCarta

3

4 Single nucleotide polymorphisms (SNPs)

5 Mendelian Traits Aa BBbb AA aa BBBbbb Aa AAAa BBbbBb Locus 1 Locus 2 AABBAABbAAbb AaBBAaBbAabb aaBBaaBbaabb affected

6 Complex Traits Aa BBBb AA aa BBBbbb aa AAAa BBbbBb Locus 1 Locus 2 AABBAABbAAbb AaBBAaBbAabb aaBBaaBbaabb affected

7 Complex Traits Complex trait implies the involvement of multiple genes and/or environmental factors Mendelian trait implies a single mutation Mendelian traits are generally rare Complex traits are common and of substantial public health impact

8 Genetic Analysis Two main areas of genetic analysis 1.Linkage analysis 2.Association analysis Methods have been developed for each approach for a variety of different study designs

9 Association Analysis In disease studies, when the disease gene is unknown, we look for association between genetic markers and the disease If a marker occurs more frequently or less frequently in affected individuals than in unaffected individuals, then it is associated with the disease.

10 Association Analysis Case-control studies –Test for association between marker alleles and the disease phenotype in a group of affected and unaffected individuals randomly from the population Family-based studies –Test for association between marker alleles and the disease phenotype in a group of affected individuals and unaffected family members

11 Case-control data structure StatusSNP1SNP2SNP3SNP4SNP5SNP6SNP7SNP8SNP9SNP10 11221212212 10001000010 10201102011 12011020110 12110021100 11000010000 11101211012 11010210102 10002000020 10010100101 02101021010 00110001100 01102111021 00020100201 02101121011 00020000200 01001210012 00111201112 01100211002 00120001200

12 Association Analysis Single marker tests Haplotype association Epistasis

13 Single marker tests SNP 1  Disease ??? SNP 2 SNP 3

14 Haplotype

15 Haplotype Analysis May be able to increase power by testing for association with marker haplotype Haplotype is a block of DNA that stays intact through generations Do not directly observe marker haplotypes Use likelihood methods to infer

16 Haplotype Analysis

17 Epistasis: Gene-Gene Interactions W. Bateson, Mendel’s Principles of Heredity (1909) A.R. Templeton, In: Wade et al. (eds), Epistasis and the Evolutionary Process (2000) Epistasis first used by William Bateson (1909) Literal translation is “standing upon” (I.e. one gene masks the effects of another gene). Genotype at Locus A Genotype at Locus B BBBbbb AA WhiteGrey Aa BlackGrey Aa BlackGrey Cordell, Human Molecular Genetics 11:2463-8 (2002)

18 Gene-gene Interactions Searching for gene-gene interactions brings about a whole new suite of problems and challenges Types of interactions –Additive –Multiplicative –Epistatic Curse of dimensionality – big problem

19 Curse of Dimensionality AAAaaa SNP 1 N = 10050 Cases, 50 Controls

20 SNP 2 AAAaaa BB Bb bb N = 10050 Cases, 50 Controls SNP 1 Curse of Dimensionality

21 N = 100 50 Cases, 50 Controls AA Aaaa BB Bb bb CCCccc DD Dd dd AA Aaaa AA Aaaa BB Bb bb BB Bb bb SNP 1 SNP 2 SNP 4 SNP 3 Curse of Dimensionality

22 Three Other Issues to Consider 1.Variable selection 2.Model selection 3.Interpretation

23 1. Variable Selection How can you determine which variables to select? Not computationally feasible to evaluate all possible combinations Need to select correct variables to detect interactions

24 How many combinations are there? ~500,000 SNPs span 80% of common variation in genome (HapMap) SNPs in each subset 5 x 10 5 2 x 10 16 1 x 10 11 3 x 10 21 2 x 10 26 Number of Possible Combinations

25 How many combinations are there? ~500,000 SNPs span 80% of common variation in genome (HapMap) SNPs in each subset 5 x 10 5 2 x 10 16 1 x 10 11 3 x 10 21 2 x 10 26 Number of Possible Combinations 2 x 10 26 combinations * 1 combination per second * 86400 seconds per day --------- 2.979536 x 10 21 days to complete (8.163113 x 10 18 years)

26 2. Model Selection For each variable subset, evaluate a statistical model Goal is to identify the best subset of variables that compose the best model

27 Finding the best model Choose variable subset Choose statistical model Evaluate model fitness Best model

28 Simple Fitness Landscape Model Fitness

29 Complex Fitness Landscape Fitness Model

30 3. Interpretation Selection of best statistical model in a vast search space of possible models Statistical or computational model may not translate into biology May not be able to identify prevention or treatment strategies directly Wet lab experiments will be necessary, but may not be sufficient

31 3. Interpretation Strategies to assess biological interpretation of gene-gene interaction models 1.Consider current knowledge about the biochemistry of the system and the biological plausibility of the models 2.Perform experiments in the wet lab to measure the effect of small perturbations to the system 3.Computer simulation algorithms to model biochemical systems

32 Additional Challenges (true of all association studies) Sample size and power/type I error Population specific effects –Age, gender Poorly matched cases and controls –Ethnic background –Controls must be “at risk” Bias Heterogeneity

33 Phenotypic (Clinical, Trait) –Affected individuals vary in clinical expression Genetic –Different inheritance patterns for same disease Locus –Different genes lead to the same disease Allelic –Different alleles at the same gene lead to same/different disease Thornton-Wells TA, Moore JH, Haines JL. Trends in Genetics, 2004;20(12):640-7..

34 New Statistical Approaches Data Reduction –Combinatorial Partitioning Method (CPM) –Multifactor Dimensionality Reduction (MDR) –Detection of informative combined effects (DICE) –Logic Regression –Set Association Analysis Pattern Recognition –Symbolic Discriminant Analysis (SDA) –Cellular Automata (CA) –Neural Networks (NN)

35 Areas of Future Work (possible collaborations) More analytical methods for gene-gene and gene-environment interactions –Especially including categorical and continuous variables simultaneously Inclusion of pathway information into analyses Ways of dealing with heterogeneity of all kinds


Download ppt "The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,"

Similar presentations


Ads by Google