Presentation is loading. Please wait.

Presentation is loading. Please wait.

Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir.

Similar presentations


Presentation on theme: "Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir."— Presentation transcript:

1 Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir Friedman and Edison Liu.)

2 Problem: Ortholog mapping Pair genes in one organism with their equivalent counterparts in another Useful for supporting medical research using animal models

3 A little molecular biology DNA has nucleotides (A, C, T and G) arranged linearly along chromosomes Regions of DNA, called genes, encode proteins Proteins biochemical workhorses Proteins made up of amino acids also strung together linearly fold up to form 3D structure

4 Mutations and evolution Speciation often roughly as follows: one species separated into two populations separate populations’ genomes drift apart through mutation important parts (e.g. genes) drift less Orthologs have common evolutionary ancestor Genes sometimes copied original retains function copy drifts or dies out Both fine-grained and coarse-grained mutations

5 Evidence of orthology (protein) sequence similarity comparison with third organism conservation of synteny......

6 Conserved synteny Neighbor relationships often preserved Consequently, similarity among their neighbors evidence that a pair of genes are orthologs

7 Plan Identify numerical features corresponding to sequence similarity common similarity to third organism conservation of synteny “Learn” mapping from feature values to prediction

8 Problem – no “gold standard” for mouse-human orthology, Jackson database reasonable for human-zebrafish? human-pombe?

9 Another “no gold standard” problem: protein-protein interactions Sources of evidence: Yeast two-hybrid Rosetta Stone Phage display All yield errors......

10 Related Theoretical Work [MV95] – Problem Goal: given m training examples generated as below output accurate classifier h Training example generation: All variables {0,1}-valued Y chosen randomly, fixed X 1,...,X n chosen independently with Pr(X i = Y) = p i, where p i is unknown, same when Y is 0 or 1 (crucial for analysis) only X 1,...,X n given to training algorithm

11 Related Theoretical Work [MV95] – Results If n ≥ 3, can approach Bayes error (best possible for source) as m gets large Idea: variable “good” if often agrees with others can e.g. solve for Pr(X 1 = Y) as function of Pr(X 1 = X 2 ), Pr(X 1 = X 3 ), and Pr(X 2 = X 3 ) can estimate Pr(X 1 = X 2 ), Pr(X 1 = X 3 ), and Pr(X 2 = X 3 ) from the training data can plug in to get estimates of Pr(X 1 = Y),...,Pr(X n = Y) can use resulting estimates of Pr(X 1 = Y),...,Pr(X n = Y) to approximate optimal classifier for source

12 In our problem(s)... Pr(Y = 1) small X 1,...,X n continuous-valued Reasonable to assume X 1,...,X n conditionally independent given Y Reasonable to assume Pr(Y = 1 | X i = x) increasing in x, for all i Sufficient to sort training examples in order of associated conditional probabilities that Y = 1

13 Key Idea Suppose Pr(Y = 1) known For variable i, Set threshold so that Pr(U i = 1) = Pr(Y = 1) Then Pr(Y = 1 and U i = 0) = Pr(Y = 0 and U i = 1) Can solve for these error probabilities for all i in terms of probabilities U i ’s agree,... - - - - - - - - - - - - - + -- - - + + - + - - + + + U i = 1 U i = 0

14 Final Plan (informal) Assume various values of Pr(Y = 1); predict orthologs given each For pairs of genes predicted to be orthologs even when Pr(Y = 1) assumed small, confidently predict orthology For pairs of genes predicted to be orthologs only when Pr(Y = 1) assumed pretty big, predict orthology more tentatively

15 Final Plan – Probabilistic Viewpoint Consider hidden variable Z : takes values uniformly distributed in [0,1] interpretation: “obviously orthologous” Assumptions Pr(Y = 1| Z = z) increasing in z For all z, Pr(Z ≥ z | X i = x ) increasing in x For various z Let V z = 1 if Z ≥ z, V z = 0 otherwise Let U z,i = 1 if X i ≥ θ z,i, U z,i = 0 otherwise, where θ z,i chosen so that Pr(U z,i = 1) = Pr(V z = 1) Interpretations: V z is “In the top 100(1- z )% most likely to have Y = 1 overall” U z,i “In the top 100(1- z ) % most likely to have Y = 1 given X i ”

16 Final Plan - Algorithm Estimate conditional probability that V z = 1, i.e. that Z ≥ z, given each training example, using estimated probabilities pairs of U z,i ’s agree Add to estimate Z ’s; sort by estimates.

17 Practical problem Small errors in estimates of Pr(U z,i = U z,j ) ’s can lead to large errors in estimates of Pr(U z,i = V z )’s (in fact, program crashes). Solution: when Pr(V z = 1 ) small is important case (confident predictions) can approximate: Pr(U z,i ≠ V z ) ~ ½ (Pr(U z,i ≠ U z,j ) + Pr(U z,i ≠ U z,k ) - Pr(U z,j ≠ U z,k )).

18 Evaluation: Artificial Source Examples generated using randomly chosen probability distribution: Pr(Y z = 1) = 0.1, n = 5 For each i, choose μ i uniformly from [min,max] set distributions for i th variable: Pr(X i | Y=0) = N(-μ i,1), Pr(X i | Y=1) = N(μ i,1). Evaluate using area under the ROC curve Repeat 100 times, average

19 ROC curve False positives True positives 1 1 Area under the ROC curve

20 Results: Artificial Source mmin μmax μ peer AUC opt (w/ Y ’s) 10000.21.0.940.985 10000.10.5.811.881 10000.050.25.635.818 10000.020.1.611.753

21 Evaluation: mouse-human ortholog mapping Use Jackson mouse-human ortholog database as “gold standard” Apply algorithm, post-processing to map each gene to unique ortholog Compare with analogous BLAST-only algorithm Plot ROC curve Treat anything not in database as non-ortholog some “false positives” in fact correct error rate overestimated

22 Results: mouse-human ortholog mapping

23 Open problems Given our assumptions, is there an algorithm for learning using random examples that always approaches the optimal AUC given knowledge of the source? Is discretizing the independent variables necessary? How does our method compare with other natural algorithms? (E.g. what about algorithms based on clustering?)


Download ppt "Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir."

Similar presentations


Ads by Google