Design and Validation of Methods Searching for Risk Factors in Genotype Case- Control Studies Dumitru Brinza Alexander Zelikovsky Department of Computer.

Slides:



Advertisements
Similar presentations
What is an association study? Define linkage disequilibrium
Advertisements

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Basics of Linkage Analysis
1 Cladistic Clustering of Haplotypes in Association Analysis Jung-Ying Tzeng Aug 27, 2004 Department of Statistics & Bioinformatics Research Center North.
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Applying haplotype models to association study design Natalie Castellana June 7, 2005.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Positional Cloning LOD Sib pairs Chromosome Region Association Study Genetics Genomics Physical Mapping/ Sequencing Candidate Gene Selection/ Polymorphism.
Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis Dinu et al, J. Biomedical.
Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.
Genotype Susceptibility And Integrated Risk Factors for Complex Diseases Weidong Mao Dumitru Brinza Nisar Hundewale Stefan Gremalshi Alexander Zelikovsky.
Modes of selection on quantitative traits. Directional selection The population responds to selection when the mean value changes in one direction Here,
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
Disease Models and Association Statistics Nicolas Widman CS 224- Computational Genetics Nicolas Widman CS 224- Computational Genetics.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
A single-nucleotide polymorphism tagging set for human drug metabolism and transport Kourosh R Ahmadi, Mike E Weale, Zhengyu Y Xue, Nicole Soranzo, David.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
CS177 Lecture 10 SNPs and Human Genetic Variation
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
Informative SNP Selection Based on Multiple Linear Regression
From Genome-Wide Association Studies to Medicine Florian Schmitzberger - CS 374 – 4/28/2009 Stanford University Biomedical Informatics
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
BGRS 2006 SEARCH FOR MULTI-SNP DISEASE ASSOCIATION D. Brinza, A. Perelygin, M. Brinton and A. Zelikovsky Georgia State University, Atlanta, GA, USA 123.
Complement Factor H Polymorphism in Age- Related Macular Degeneration* *Klein RJ, et al. Science. 2005; 308:
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Risk Prediction of Complex Disease David Evans. Genetic Testing and Personalized Medicine Is this possible also in complex diseases? Predictive testing.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Combinatorial Search Methods for Genotypes Associated with Lung Cancer Dumitru Brinza Department of Computer Science Georgia State University.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Genome Wide Haplotype analyses of human.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Optimization Problems
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
Risk Factor Searching Heuristics for SNP Case-Control Studies Dumitru Brinza February 21, 2008 Department of Computer Science & Engineering University.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility.
Multiple-Locus Genome-Wide Association Testing David Dean CSE280A.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Date of download: 7/2/2016 Copyright © 2016 American Medical Association. All rights reserved. From: How to Interpret a Genome-wide Association Study JAMA.
SNPs and complex traits: where is the hidden heritability?
Constrained Hidden Markov Models for Population-based Haplotyping
Searching for Disease Causing Genes Thomas Mailund
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Medical genomics BI420 Department of Biology, Boston College
Medical genomics BI420 Department of Biology, Boston College
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Design and Validation of Methods Searching for Risk Factors in Genotype Case- Control Studies Dumitru Brinza Alexander Zelikovsky Department of Computer Science Georgia State University SNPHAP 2007, January 27, 2007

Outline  SNPs, Haplotypes and Genotypes  Heritable Common Complex Diseases  Disease Association Search in Case-Control Studies  Addressing Challenges in DA  Risk Factor Validation for Reproducibility  Atomic risk factors/Multi-SNP Combinations  Maximum Odds Ratio Atomic RF  Approximate vs Exhaustive Searches  Datasets/Results  Conclusions / Related & Future Work

SNP, Haplotypes, Genotypes Human Genome – all the genetic material in the chromosomes, length 3×10 9 base pairs Difference between any two people occur in 0.1% of genome SNP – single nucleotide polymorphism site where two or more different nucleotides occur in a large percentage of population. Diploid – two different copies of each chromosome Haplotype – description of a single copy (expensive) example: (0 is for major, 1 is for minor allele) Genotype – description of the mixed two copies example: (0=00, 1=11, 2=01)

Heritable Common Complex Diseases Complex disease  Interaction of multiple genes One mutation does not cause disease Breakage of all compensatory pathways cause disease Hard to analyze - 2-gene interaction analysis for a genome- wide scan with 1 million SNPs has pair wise tests  Multiple independent causes There are different causes and each of these causes can be result of interaction of several genes Each cause explains certain percentage of cases Common diseases are Complex : > 0.1%. In NY city, 12% of the population has Type 2 Diabetes

DA Search in Case/Control Study Disease Status Case genotypes: Control genotypes: SNPs Find: risk factors (RF) with significantly high odds ratio i.e., pattern/dihaplotype significantly more frequent among cases than among controls Given: a population of n genotypes each containing values of m SNPs and disease status

Challenges in Disease Association Computational  Interaction of multiple genes/SNP’s Too many possibilities – obviously intractable  Multiple independent causes Each RF may explain only small portion of case-control study Statistical/Reproducing  Search space / number of possible RF’s Adjust to multiple testing  Searching engine complexity Adjust to multiple methods / search complexity

Addressing Challenges in DA Computational  Constraint model / reduce search space Negative effect = may miss “true” RF’s   Heuristic search Look for “easy to find” RF’s May miss only “maliciously hidden” true RF Statistical/Reproducing  Validate on different case-control study That’s obvious but expensive   Cross-validate in the same study Usual method for prediction validation

Significance of Risk Factors Relative risk (RR) – cohort study Odds ratio (OR) – case-control study P-value  binomial distribution  Searching for risk factors among many SNPs requires multiple testing adjustment of the p-value

Reproducibility Control Multiple-testing adjustment  Bonferroni easy to compute overly conservative  Randomization computationally expensive more accurate Validation rate using Cross-Validation  Leave-One-Out  Leave-Many-Out  Leave-Half-Out

Atomic Risk Factors, MSCs and Clusters  Genotype SNP = Boolean function over 2 haplotype SNPs 0 iff g 0 = (x NOR y) is TRUE 1 iff g 1 = (x AND y) is TRUE 2 iff g 2 = (x XOR y) is TRUE  Single-SNP risk factor = Boolean formula over g 0, g 1 and g 2  Complex risk factor (RF) = CNF over single-SNP RF’s: g 0 1 (g 0 + g 2 ) 2 (g 1 + g 2 ) 3 g 0 5  Atomic risk factor (ARF) = unsplittable complex RF’s: g 0 1 g 2 2 g 1 3 g 0 5 single disease-associated factor  ARF ↔ multi-SNP combination (MSC) MSC = subset of SNP with fixed values of SNPs, 0, 1, or 2  Cluster= subset of genotypes with the same MSC

MORARF formulation Maximum Odds Ratio Atomic Risk Factor  Given: genotype case-control study  Find: ARF with the maximum odds ratio Clusters with less controls have higher OR => MORARF includes finding of max control-free cluster MORARF contains max independent set problem => No provably good search for general case-control study Case-control studies do not bother to hide true RF => Even simple heuristics may work

Requirements to Approximate search  Fast longer search needs more adjustment  Non-trivial exhaustive search is slow  Simple Occam’s razor

Exhaustive Searching Approaches Exhaustive search (ES)  For n genotypes with m SNPs there are O(n km ) k-SNP MSCs Exhaustive Combinatorial Search (CS)  Drop small (insignificant) clusters  Search only plausible/maximal MSC’s Case-closure of MSC: MSC extended with common SNPs values in all cases Minimum cluster with the same set of cases control case case case control x x 1 x x 2 x x x Present in 2 cases : 2 controls Case-closure control case case case control x x 1 x x 2 x 0 x Present in 2 cases : 1 control i i

Combinatorial Search Combinatorial Search Method (CS):  Searches only among case-closed MSCs  Avoids checking of clusters with small number of cases  Finds significant MSCs faster than ES  Still too slow for large data  Further speedup by reducing number of SNPs

Complimentary Greedy Search (CGS) Intuition:  Max OR when no controls – chosen cases do not have simila  Max independent set by removing highest degree vertices Fixing an SNP-value  Removes controls -> profit  Removes cases  -> expense Maximize profit/expense! Algorithm:  Starting with empty MSC add SNP-value removing from current cluster max # controls per case Extremely fast but inaccurate, trapped in local maximum CasesControls

Disease Association Search AcS – alternating combinatorial search method RCGS – Randomized complimentary greedy search method

5 Data Sets Crohn's disease (Daly et al ): inflammatory bowel disease (IBD). Location: 5q31 Number of SNPs: 103 Population Size: 387 case: 144 control: 243 Autoimmune disorders (Ueda et al) : Location: containing gene CD28, CTLA4 and ICONS Number of SNPs: 108 Population Size: 1024 case: 378 control: 646 Tick-borne encephalitis (Barkash et al) : Location: containing gene TLR3, PKR, OAS1, OAS2, and OAS3. Number of SNPs: 41 Population Size: 75 case: 21 control: 54 Lung cancer (Dragani et al) : Number of SNPs: 141 Population Size: 500 case: 260 control: 240 Rheumatoid Arthritis (GAW15) : Number of SNPs: 2300 Population Size: 920 case: 460 control: 460

Search Results

Validation Results

Conclusions Approximate search methods find more significant RF’s RF found by approximate searches have higher cross-validation rate  Significant MSC’s are better cross-validated Significant MSC’s with many SNPs (>10) can be efficiently found and confirmed RCGS (randomized methods) is better than CGS (deterministic methods)

Related & Future Work More randomized methods  Simulated Annealing/Gibbs Sampler/HMM  But they are slower  Indexing (have our MLR tagging)  Find MSCs in samples reduced to index/tag SNPs  May have more power (?) Disease Susceptibility Prediction  Use found RF for prediction rather prediction for RF search