The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

Linkage and Genetic Mapping
Lecture 2 Strachan and Read Chapter 13
What is an association study? Define linkage disequilibrium
CZ5225 Methods in Computational Biology Lecture 9: Pharmacogenetics and individual variation of drug response CZ5225 Methods in Computational Biology.
Genetic research designs in the real world Vishwajit L Nimgaonkar MD, PhD University of Pittsburgh
Genetic Analysis in Human Disease
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
1 Cladistic Clustering of Haplotypes in Association Analysis Jung-Ying Tzeng Aug 27, 2004 Department of Statistics & Bioinformatics Research Center North.
Parametric versus Non-parametric Genetic Association Analysis Kristel Van Steen, PhD, ScD Université de Liege - Institut Montefiore.
Gene-gene and gene-environment interactions Manuel Ferreira Massachusetts General Hospital Harvard Medical School Center for Human Genetic Research.
Simulation/theory With modest marker spacing in a human study, LOD of 3 is 9% likely to be a false positive.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis Dinu et al, J. Biomedical.
Gene-gene and gene-environment interactions Manuel Ferreira Massachusetts General Hospital Harvard Medical School Center for Human Genetic Research.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Observing Patterns in Inherited Traits
Introduction to Molecular Epidemiology Jan Dorman, PhD University of Pittsburgh School of Nursing
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
CS177 Lecture 10 SNPs and Human Genetic Variation
Gene Hunting: Linkage and Association
Informative SNP Selection Based on Multiple Linear Regression
Genome-Wide Association Study (GWAS)
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Experimental Design and Data Structure Supplement to Lecture 8 Fall
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Genes in human populations n Population genetics: focus on allele frequencies (the “gene pool” = all the gametes in a big pot!) n Hardy-Weinberg calculations.
JM - 1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,
Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.
1 B-b B-B B-b b-b Lecture 2 - Segregation Analysis 1/15/04 Biomath 207B / Biostat 237 / HG 207B.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Who was Mendel? Mendel – first to gather evidence of patterns by which parents transmit genes to offspring.
Grammatical Evolution Neural Networks for Genetic Epidemiology Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North.
A PPROACHING THE G ENOME - G ENETIC M ARKERS, L INKAGE AND A SSOCIATION G ENETICS 202 Jon Bernstein Department of Pediatrics October 8, 2015.
Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.
The International Consortium. The International HapMap Project.
C2BAT: Using the same data set for screening and testing. A testing strategy for genome-wide association studies in case/control design Matt McQueen, Jessica.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Complex Adaptive Systems and Human Health: Statistical Approaches in Pharmacogenomics Kim E. Zerba, Ph.D. Bristol-Myers Squibb FDA/Industry Statistics.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Admixture Mapping Controlled Crosses Are Often Used to Determine the Genetic Basis of Differences Between Populations. When controlled crosses are not.
Statistical Analysis of Candidate Gene Association Studies (Categorical Traits) of Biallelic Single Nucleotide Polymorphisms Maani Beigy MD-MPH Student.
An atlas of genetic influences on human blood metabolites Nature Genetics 2014 Jun;46(6)
From: Scheinfeld A (1965) Your heredity and environment. JB Lippincott Company, Philadelphia Phenotypic variation among humans is enormous.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Quantitative genetics
Genetics in Clinical Research Jonathan L. Haines, Ph.D. Center for Human Genetics Research 7/16/04.
Statistical Applications in Biology and Genetics
Recombination (Crossing Over)
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Chapter 7 Multifactorial Traits
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Medical genomics BI420 Department of Biology, Boston College
Medical genomics BI420 Department of Biology, Boston College
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville, TN

Biology is complex BioCarta

Single nucleotide polymorphisms (SNPs)

Mendelian Traits Aa BBbb AA aa BBBbbb Aa AAAa BBbbBb Locus 1 Locus 2 AABBAABbAAbb AaBBAaBbAabb aaBBaaBbaabb affected

Complex Traits Aa BBBb AA aa BBBbbb aa AAAa BBbbBb Locus 1 Locus 2 AABBAABbAAbb AaBBAaBbAabb aaBBaaBbaabb affected

Complex Traits Complex trait implies the involvement of multiple genes and/or environmental factors Mendelian trait implies a single mutation Mendelian traits are generally rare Complex traits are common and of substantial public health impact

Genetic Analysis Two main areas of genetic analysis 1.Linkage analysis 2.Association analysis Methods have been developed for each approach for a variety of different study designs

Association Analysis In disease studies, when the disease gene is unknown, we look for association between genetic markers and the disease If a marker occurs more frequently or less frequently in affected individuals than in unaffected individuals, then it is associated with the disease.

Association Analysis Case-control studies –Test for association between marker alleles and the disease phenotype in a group of affected and unaffected individuals randomly from the population Family-based studies –Test for association between marker alleles and the disease phenotype in a group of affected individuals and unaffected family members

Case-control data structure StatusSNP1SNP2SNP3SNP4SNP5SNP6SNP7SNP8SNP9SNP

Association Analysis Single marker tests Haplotype association Epistasis

Single marker tests SNP 1  Disease ??? SNP 2 SNP 3

Haplotype

Haplotype Analysis May be able to increase power by testing for association with marker haplotype Haplotype is a block of DNA that stays intact through generations Do not directly observe marker haplotypes Use likelihood methods to infer

Haplotype Analysis

Epistasis: Gene-Gene Interactions W. Bateson, Mendel’s Principles of Heredity (1909) A.R. Templeton, In: Wade et al. (eds), Epistasis and the Evolutionary Process (2000) Epistasis first used by William Bateson (1909) Literal translation is “standing upon” (I.e. one gene masks the effects of another gene). Genotype at Locus A Genotype at Locus B BBBbbb AA WhiteGrey Aa BlackGrey Aa BlackGrey Cordell, Human Molecular Genetics 11: (2002)

Gene-gene Interactions Searching for gene-gene interactions brings about a whole new suite of problems and challenges Types of interactions –Additive –Multiplicative –Epistatic Curse of dimensionality – big problem

Curse of Dimensionality AAAaaa SNP 1 N = Cases, 50 Controls

SNP 2 AAAaaa BB Bb bb N = Cases, 50 Controls SNP 1 Curse of Dimensionality

N = Cases, 50 Controls AA Aaaa BB Bb bb CCCccc DD Dd dd AA Aaaa AA Aaaa BB Bb bb BB Bb bb SNP 1 SNP 2 SNP 4 SNP 3 Curse of Dimensionality

Three Other Issues to Consider 1.Variable selection 2.Model selection 3.Interpretation

1. Variable Selection How can you determine which variables to select? Not computationally feasible to evaluate all possible combinations Need to select correct variables to detect interactions

How many combinations are there? ~500,000 SNPs span 80% of common variation in genome (HapMap) SNPs in each subset 5 x x x x x Number of Possible Combinations

How many combinations are there? ~500,000 SNPs span 80% of common variation in genome (HapMap) SNPs in each subset 5 x x x x x Number of Possible Combinations 2 x combinations * 1 combination per second * seconds per day x days to complete ( x years)

2. Model Selection For each variable subset, evaluate a statistical model Goal is to identify the best subset of variables that compose the best model

Finding the best model Choose variable subset Choose statistical model Evaluate model fitness Best model

Simple Fitness Landscape Model Fitness

Complex Fitness Landscape Fitness Model

3. Interpretation Selection of best statistical model in a vast search space of possible models Statistical or computational model may not translate into biology May not be able to identify prevention or treatment strategies directly Wet lab experiments will be necessary, but may not be sufficient

3. Interpretation Strategies to assess biological interpretation of gene-gene interaction models 1.Consider current knowledge about the biochemistry of the system and the biological plausibility of the models 2.Perform experiments in the wet lab to measure the effect of small perturbations to the system 3.Computer simulation algorithms to model biochemical systems

Additional Challenges (true of all association studies) Sample size and power/type I error Population specific effects –Age, gender Poorly matched cases and controls –Ethnic background –Controls must be “at risk” Bias Heterogeneity

Phenotypic (Clinical, Trait) –Affected individuals vary in clinical expression Genetic –Different inheritance patterns for same disease Locus –Different genes lead to the same disease Allelic –Different alleles at the same gene lead to same/different disease Thornton-Wells TA, Moore JH, Haines JL. Trends in Genetics, 2004;20(12):

New Statistical Approaches Data Reduction –Combinatorial Partitioning Method (CPM) –Multifactor Dimensionality Reduction (MDR) –Detection of informative combined effects (DICE) –Logic Regression –Set Association Analysis Pattern Recognition –Symbolic Discriminant Analysis (SDA) –Cellular Automata (CA) –Neural Networks (NN)

Areas of Future Work (possible collaborations) More analytical methods for gene-gene and gene-environment interactions –Especially including categorical and continuous variables simultaneously Inclusion of pathway information into analyses Ways of dealing with heterogeneity of all kinds