Predicting Genetic Regulatory Response Using Classification Us v. Them (“Them” being Manuel Middendorf, Anshul Kundaje, Chris Wiggins, Yoav Freund, and.

Predicting Genetic Regulatory Response Using Classification Us v. Them (“Them” being Manuel Middendorf, Anshul Kundaje, Chris Wiggins, Yoav Freund, and Christina Leslie, in “Predicting Genetic Regulatory Response Using Classification” (2004)

The Problem Current studies of gene transcription tend to be descriptive Need for a predictive system – the ability to predict gene regulation for new experiments Rather than determining patterns in sets of genes and conditions, look at underyling causes of those patterns

The Important Parts of Genes & Experiments Regulation is determined by binding sites (motifs) and regulators (parents) The significance of experiments, then, is how they affect regulators The significance of genes is what motifs they contain

Binding Sites and Regulation Discretize gene response into only up-regulated (1), down-regulated(-1), or unchanged (0) A motif is either present (1) or absent (0) A parent is either up-regulated (1), down- regulated (-1), or unchanged (0) Assume (and we need to check with someone who actually knows something about biology on this) that things only happen if motif is present and parent is either up- or down-regulated

What our matrix really looks like g = # of genes e = # of experiments p = # of parents m = # of motifs Then we have g*e response values For each response, we have p*m parent/motif combinations For each parent/motif combination, there are three possibilities – present and up-regulated, present and down-regulated, or all those other possibilities where nothing happens Represent these possibilities as a pair of binary variables, one for up and one for down

M1, P1 + M1, P1 - M1, P2 + M1, P2 - …M1, Pp + …Mm, Pp - G1, E1 M1, P1 + M1, P1 - M1, P2 + M1, P2 - …M1, Pp + …Mm, Pp - G1, E2 M1, P1 + M1, P1 - M1, P2 + M1, P2 - …M1, Pp + …Mm, Pp - … M1, P1 + M1, P1 - M1, P2 + M1, P2 - …M1, Pp + …Mm, Pp - G1, Ee M1, P1 + M1, P1 - M1, P2 + M1, P2 - …M1, Pp + …Mm, Pp - … M1, P1 + M1, P1 - M1, P2 + M1, P2 - …M1, Pp + …Mm, Pp - Gg, Ee

Some Numbers In the paper, the initial dataset had 6110 genes and 173 experiments 354 motifs are considered 475 regulators are considered Set of genes to consider is reduced to only 1411 genes of interest

Some More Numbers Only train on genes that are up- or down- regulated Approx. 8% of gene/experiment pairs from the overall sample appear to be, so, assuming this holds true in the reduced sample, we have 19,632 gene / experiment pairs to train on For each of these values we have 2*354*475 = 336,300 predictor variables

Some Problems 19,632 by 336,300 is an awfully large matrix to want to do any calculations with We have far more variables than observations

Possible Solutions Random Projection: –Pro: we can reduce our dimensionality –Con: it seems like a somewhat silly approach –Con: there’ll still be a lot of calculations just to make the projection

Possible Solutions Variable Selection or PCA –Pro: we could reduce our dimensionality in a more informed way –Con: computationally painful

Possible Solutions Random Forest From Breiman and Cutler: Grow a number of classification trees, and take the vote of the classifications of all trees For each tree, if we have n cases, sample, with replacement, n cases, using some number of randomly chosen variables much smaller than the full number of variables –Pro: allows for flexibility in reduction of dimensions being considered –Pro: dimension is reduced without computation problems –Con: ?

Their Solution Alternating decision tree: A tree with alternating levels of prediction nodes and splitter nodes Combines a set of weak prediction rules (the splitter nodes) to make one strong rule

Alternating Decision Trees v. Random Forests Both the same rough concept – take a vote from many weak rules to get a strong rule In ADTs, rules are based on single variables, and may be conditioned on values of other variables In random forests, rules can be based on multiple variables, and are only marginal over all values of other variables ADTs are fairly straightforward to read and interpret

Alternating Decision Trees Pro: computationally kind Pro: works better than Random Forests (Creamer & Freund, 2004) Con: we didn’t come up with it

Other Ideas?

Predicting Genetic Regulatory Response Using Classification Us v. Them (“Them” being Manuel Middendorf, Anshul Kundaje, Chris Wiggins, Yoav Freund, and.

Similar presentations

Presentation on theme: "Predicting Genetic Regulatory Response Using Classification Us v. Them (“Them” being Manuel Middendorf, Anshul Kundaje, Chris Wiggins, Yoav Freund, and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Predicting Genetic Regulatory Response Using Classification Us v. Them (“Them” being Manuel Middendorf, Anshul Kundaje, Chris Wiggins, Yoav Freund, and.

Similar presentations

Presentation on theme: "Predicting Genetic Regulatory Response Using Classification Us v. Them (“Them” being Manuel Middendorf, Anshul Kundaje, Chris Wiggins, Yoav Freund, and."— Presentation transcript:

Similar presentations

About project

Feedback