Presentation is loading. Please wait.

Presentation is loading. Please wait.

Modeling Gene Interactions in Disease CS 686 Bioinformatics.

Similar presentations


Presentation on theme: "Modeling Gene Interactions in Disease CS 686 Bioinformatics."— Presentation transcript:

1 Modeling Gene Interactions in Disease CS 686 Bioinformatics

2 Some Definitions Data mining: extracting hidden patterns and useful info from large data sets. Ex- clustering, machine learning. Should not be: "Torturing data until it confesses... and if you torture it enough, it will confess to anything" - Jeff Jonas, IBM Machine learning: the ability of a program to learn from experience. Ex- neural networks, decision trees, rule-based methods, MDR.

3 Methods Regression methods: modeling the relationship between a dependent variable and one of more independent variables. Data mining methods: Search the space of possible models efficiently. Better with non-linear and high-dimensional data, or data with many potential interactions. Exhaustive Search: search all possible models for the best one.

4 Linear regression Relates outcome as a linear combination of the parameters (but not necessarily of the independent variables). Ex: Let y = incidence of disease, n data points. Independent variables A,B 1) y i = b 0 + b 1 A i + ε i, i = 1,…,n 2) y i = b 0 + b 2 (B i ) 2 + ε i, i = 1,…,n where b 0, b 1, b 2 = parameters, ε i is error term. In both of these examples, the disease is modeled as linear in the parameters, although it is quadratic in variable B

5 Linear regression Given a sample, we estimate the params (ex: can use least squares) to arrive at the linear regression model: [1]

6 Multiple regression Relates the the probability of an event to a linear combination of predictor variables. Ex: Let y = incidence of disease, n data points. Independent variables x 1, x 2 y i = b 0 + b 1 x i1 + b 2 x i2 + … + b p x i p + ε i, i = 1,…,n Best-fit line: For each unit increase in x i p, is expected to increase by.

7 Logistic regression[1] Often used when the outcome is binary, relates the log-odds of the probability of an event to a linear combination of predictor variables. Ex: ln(p/(1 – p)) = α + βxB + γxC + ixBxC, where xB and xC are measured binary indicator variables, and regression coefficients β and y represent main effects, i represents interaction.

8 Other statistical methods [1] Bayesian model selection: a statistical approach incorporating both prior distributions for parameters and observed data into the model. Maximum likelihood: a statistical method used to make inferences about the combination of parameter values resulting in the highest probability of obtaining the observed data

9 Modeling Terminology[1] Saturated: a statistical model that is as full as possible (saturated) with parameters. Marginal effects: the effects of one parameter averaged over the possible values taken by other parameters Entropy: the uncertainty associated with a random variable

10 Modeling Terminology[1] Cross-validation: partitioning a data set into n subsets, then using each subset in turn as the test set while using the other n-1 to train. Overfitting: a model that provides a good fit to a specific data set but generalizes poorly. Marginal effects: the effects of one parameter averaged over the possible values taken by other parameters.

11 Marginal Effects [2] Marginal penetrance: Ex: The probability P(D|A=Aa), irrespective of what value B has Table II. Penetrance values for combinations of genotypes from two single nucleotide polymorphisms exhibiting interactions in the absence of independent main effects Genotype Genotype Marginal penetrance B AA (0.25) Aa (0.50) aa (0.25) BB (0.25) 0 1 0 0.5 Bb (0.50) 1 0 1 0.5 bb (0.25) 0 1 0 0.5 Marginal 0.5 0.5 0.5 penetrance A Genotype frequencies are given in parentheses Marginal penetrance values for the A, B genotypes.

12 Weka [3] A collection of visualization tools and algorithms for data analysis and predictive modeling. Preprocessing tools for reading data in a variety of formats and transforming it. Classification algorithms include regression, neural network, support vector machine, decision tree. Display includes ROC curves Clustering: k-means, expectation maximization Visualization includes scatter-plot, bar graph

13 References Cordell, 2009, Detecting gene–gene interactions that underlie human diseases. Nature Review Genetics McKinney et al, 2006, Machine Learning for Detecting Gene-Gene Interactions, A Review. Biomedical Genomics and Proteomics Weka site: http://www.cs.waikato.ac.nz/ml/weka


Download ppt "Modeling Gene Interactions in Disease CS 686 Bioinformatics."

Similar presentations


Ads by Google