Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis Dinu et al, J. Biomedical.

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

Association Tests for Rare Variants Using Sequence Data
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Putting genetic interactions in context through a global modular decomposition Jamal.
GENOMICS TERM PROJECT Assessment of Significance in a SNP.
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Basics of Linkage Analysis
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Gene Expression Data Analyses (3)
Differentially expressed genes
FINAL EXAM: TAKE-HOME Assessment of Significance in Cancer Gene SNPs.
Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.
Using biological networks to search for interacting loci in genome-wide association studies Mathieu Emily et. al. European journal of human genetics, e-pub.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
KinSNP Software for homozygosity mapping of disease genes using SNP microarrays El-Ad David Amir 1, Ofer Bartal 1, Yoni Sheinin 2, Ruti Parvari 2 and Vered.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Multiple testing correction
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
Essential Statistics in Biology: Getting the Numbers Right
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University 10/3/
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
Regulation of gene expression in the mammalian eye and its relevance to eye disease Todd Scheetz et al. Presented by John MC Ma.
Jeopardy Genes and Chromosomes Basics
Mendel and the Gene Idea
Comp. Genomics Recitation 3 The statistics of database searching.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Experimental Design and Data Structure Supplement to Lecture 8 Fall
Type 1 Error and Power Calculation for Association Analysis Pak Sham & Shaun Purcell Advanced Workshop Boulder, CO, 2005.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Multiple Testing Matthew Kowgier. Multiple Testing In statistics, the multiple comparisons/testing problem occurs when one considers a set of statistical.
Fruit Fly Basics Drosophila melanogaster. Wild Type Phenotype Red eyes Tan Body Black Rings on abdomen Normal Wings.
1 B-b B-B B-b b-b Lecture 2 - Segregation Analysis 1/15/04 Biomath 207B / Biostat 237 / HG 207B.
POLYMORPHISM AND VARIANT ANALYSIS Saurabh Sinha, University of Illinois.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Future Directions Pak Sham, HKU Boulder Genetics of Complex Traits Quantitative GeneticsGene Mapping Functional Genomics.
Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
XIAO WU DATA ANALYSIS & BASIC STATISTICS.
GenABEL: an R package for Genome Wide Association Analysis
Lecture 11. The chi-square test for goodness of fit.
Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility.
1 Paper Outline Specific Aim Background & Significance Research Description Potential Pitfalls and Alternate Approaches Class Paper: 5-7 pages (with figures)
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Forest Approach to Genetic Studies Heping Zhang Presented at IMS Genomic Workshop, NUS Singapore, June 8, 2009 And Xiang Chen, Ching-Ti Liu, Minghui Wang,
An atlas of genetic influences on human blood metabolites Nature Genetics 2014 Jun;46(6)
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Genes Observable traits are the result of genes. – Genes are found on the chromosomes in our cells – Each observable trait is determined by two genes,
May 4, What is an allele?. Genotype: genetics of trait (what alleles?) Homozygous: two copies of the same allele –Homozygous dominant (BB) –Homozygous.
I. CHI SQUARE ANALYSIS Statistical tool used to evaluate variation in categorical data Used to determine if variation is significant or instead, due to.
Classification with Gene Expression Data
Part 2: Genetics, monohybrid vs. Dihybrid crosses, Chi Square
Probability & Heredity
Genome Wide Association Studies using SNP
POLYMORPHISMS & ASSOCIATION TESTS
BMI/CS 776 Spring 2018 Anthony Gitter
Jeopardy Genes and Chromosomes
Statistical Analysis and Design of Experiments for Large Data Sets
Sequence comparison: Multiple testing correction
Kernel Methods for large-scale Genomics Data Analysis
The same gene can have many versions.
Presentation transcript:

Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis Dinu et al, J. Biomedical Informatics 40 (2007) 750-760

Pathway/SNP A software application that allows its user to utilize pathway data in the analysis of high-density genomic SNP data derived from disease association studies. - The purpose is to analyze the underlying etiology of disease through the integration of pathway information using statistical and data mining approaches.

Background: Large scale genome-wide association (GWA) studies are now available to identify genomic mutations associated with wide range of diseases. Complex diseases, like, diabetes, hypertension, etc. are believed to be caused by the interaction of multiple genes and environmental factors. The number of mathematical operations required to assess the association between multiple interacting genomic loci and disease grows exponentially with the number of interacting SNPs. Various statistical approaches, like stepwise algorithm, varying parameters, etc. are used to analyze these associations. Data mining approaches are used for multi-locus association with traits.

Computational complexity for brute-force ‘full-scan’ interaction analysis between all possible combinations of n genomic markers and a disease is exponential in n. For Affymetrix 100K SNP GeneChip, m = 100,000 genomic markers Full scan requires # of marker interaction # of tests 2 5.00 x109 3 1.66 x1014 4 4.16 x1018 5 8.33 x1022 Fastest supercomputer can perform ~3.67x1014 flops/s

Conclusion: Pathway/SNP “One model fits all” approach is not optimal. Pathway/SNP – Designed as an exploratory tool which integrates pathway information, gene annotation, and SNP location to identify the pathways that are most strongly associated with disease. Architecture: 3-tier architecture written in Java 1> Presentation tier – written in Java Server Pages 2> Logic tier – statistical and data mining algorithms in Java 3> Data tier – genotype, phenotype and annotation data stored in heavily indexed relational database.

Biological Data Relevant SNPs: Relevant Genes: - Annotations for 561 pathways – 181 KEGG, 314 BioCarta and 66 GenMAPP human pathways. Gene annotation data – from NCBI Entrez Gene Affymetrix 100k and 500k GeneChip microarray annotation files are preloaded in the database. Relevant SNPs: In a given biological pathway if SNPs are located within 10,000 base pairs (bp) of a pathway gene’s location, they are considered as relevant. Relevant Genes: First gene list is extracted from a particular database then it is augmented from literature and Entrez gene.

Algorithms: 1> Single SNP association with disease - Chi square and Armitage’s trend test 2> Pathway association with disease - U-statistics or data mining algorithms 3> Permutation-based statistical significance inference - Bonferroni adjustment or False discovery rate (FDR)

Single SNP association with disease: 1> Chi square test 2> Armitage’s trend test 1 degree of freedom Allele A count Allele B count Case Control More preferred Allele-based: Chi square test Genotype-based: 2 degrees of freedom AA count AB count BB count Case Control

Armitage’s Trend Test This test is performed of case vs. control having a ‘trend’ with different models of association between a SNP and disease. Additive interaction model: This model tests the association that depend additively upon the risk or minor allele, 0 for homozygous non-risk alleles, 1 for heterozygous alleles and 2 for homozygous risk alleles. Dominant model: tests the association of having at least one risk allele in homozygous (1) or heterozygous (1) vs. no risk in homozygous non-risk allele (0). Recessive model: tests the association of having one homozygous risk allele (1) vs. having at least one non-risk allele in homozygous (0) or in heterozygous (0). Armitage’s Trend Test statistic has 1 degree of freedom

U-statistics for pathway association with disease: Non-parametric algorithm that can simultaneously test the association of multiple markers with disease, with only a single degree of freedom. First measures a score over all markers for pairs of subjects (set of SNPs) within each of the case and control groups. Genetic scoring for a pair of subjects is measured by a “kernel” function, like recessive, dominant and linear dosage. Then compares the average scores between cases and controls by use of a global statistic with one degree of freedom instead of the implicit many degrees of freedom when many markers are analyzed. The resulting z-scores can be used to rank pathways and also to calculate an approximate p-value.

Consider b as risk allele and a as non-risk allele

Data mining for pathway association with disease: Data mining classifiers (e.g., SVM, Random Forests, logistic, tree-based) can be used to explore the association between pathways and disease. The “percent correct” classification of cases and controls estimated with the genotypes at the pathway SNPs can be used as a statistic for measuring the association between pathways and disease. Incorporated using Weka data mining program, classifiers are run by default with a 10-fold cross validation.

Multiple testing corrections: It may be possible that a good test statistic value that we have obtained would have occurred by chance alone. Multiple testing corrections are designed to help one to ensure, if possible, that this is not the case. Bonferroni adjustments: The Bonferonni adjustment multiplies each individual p-value by the number of times that same test was performed (the value of markers tested). This value, which is quite conservative, seeks to estimate the probability that this test would have come out this well by chance at least once from all of the times this test was performed.

Statistical significance using permutation based FDR: The False Discovery Rate (FDR) option calculates the False Discovery Rate for each statistical test selected. This is a test which is itself based upon the p-values from the original tests. The interpretation of the False Discovery Rate is “What would the rate of false discoveries (false positives) be if I accepted ALL of the tests whose p-value is at or below the p-value of this test?” The aim of the FDR procedure is to control at a desired level a (e.g., 0.05) the proportion of type I errors (false positives) among all significant results.

- Suppose m hypotheses are tested, and R of them are rejected (positive results). Of the rejected hypotheses, suppose that V of them are really null–that is, that V is the number of type I errors, or false positive results. The False Discovery Rate is defined as that is, the expected proportion of false positive findings among all rejected hypotheses times the probability of making at least one rejection. - This procedure may yield higher statistical power compared to family wise error rate. Pathways with low FDR (e.g., below 0.05) are considered significant. FDR = E(V/R | R > 0). P(R > 0),

Using Pathway/SNP to analyze AMD data set: This data set contains 116,204 genome wide SNPs genotyped with Affymetrix 100k Gene Chip Case-control study of 146 caucasian individuals 50 controls and 96 cases with advanced AMD 50 patients with wet AMD (severe) and 46 patients with dry AMD. Initial analysis identifies a mutation in complement factor H (CFH) on chromosome 1 to be strongly associated with AMD. Identified 46 genes (from KEGG & NCBI genome 35 version) Total 94 SNPs are relevant (within 10,000 bp). Armitage’s trend test with additive model and U-statistics with 5 kernels (dominant, recessive, linear, quadratic, allele match) and 4 data-mining algorithms (J48, Random Forests, SVM, Naïve Bayes) were performed. Patients were grouped in 4 categories: control vs. all cases (wet+dry), control vs. wet AMD, control vs. dry AMD, dry AMD vs. wet AMD.

Identified two additional pathway genes, C7 and MBL2:

Explanation of the difference between progressing to dry AMD, less severe form to wet AMD, more severe one

Lessons learned: The potential need for high performance computation to support a tool like Pathway/SNP The need for permutation testing to evaluate the results of the analysis Dealing with different versions of the biological data and knowledge Why different analysis algorithms might work better with different data sets and different diseases The complexity of the “clinical phenotype”