Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir.

Slides:



Advertisements
Similar presentations
The Human Genome Project Main reference: Nature (2001) 409,
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Evaluating Classifiers
Learning Algorithm Evaluation
Evaluating Diagnostic Accuracy of Prostate Cancer Using Bayesian Analysis Part of an Undergraduate Research course Chantal D. Larose.
Model Assessment, Selection and Averaging
The General Linear Model. The Simple Linear Model Linear Regression.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Phylogenetic reconstruction
Profiles for Sequences
Evidence Integration in Bioinformatics Phil Long Columbia University.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Sequence Similarity Searching Class 4 March 2010.
Genetica per Scienze Naturali a.a prof S. Presciuttini Human and chimpanzee genomes The human and chimpanzee genomes—with their 5-million-year history.
Evaluating Hypotheses
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Modeling biological data and structure with probabilistic networks I Yuan Gao, Ph.D. 11/05/2002 Slides prepared from text material by Simon Kasif and Arthur.
Molecular Clocks, Base Substitutions, & Phylogenetic Distances.
Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou Department.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Functional Associations of Protein in Entire Genomes Sequences Bioinformatics Center of Shanghai Institutes for Biological Sciences Bingding.
CHAPTER 18: Inference about a Population Mean
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Genomics and Forensics
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Measures of Conserved Synteny Work was funded by the National Science Foundation’s Interdisciplinary Grants in the Mathematical Sciences All work is joint.
Cédric Notredame (08/12/2015) Molecular Evolution Cédric Notredame.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ANOVA, Regression and Multiple Regression March
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Data Mining and Decision Support
Construction of Substitution matrices
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
Estimating standard error using bootstrap
Lecture 1.31 Criteria for optimal reception of radio signals.
Basics of Comparative Genomics
Bud Mishra Professor of Computer Science and Mathematics 12 ¦ 3 ¦ 2001
1 Department of Engineering, 2 Department of Mathematics,
Genomes and Their Evolution
1 Department of Engineering, 2 Department of Mathematics,
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
1 Department of Engineering, 2 Department of Mathematics,
Unit 1: Evolution Lesson 4: Evidence of Evolution
Basics of Comparative Genomics
CHAPTER 18: Inference about a Population Mean
Presentation transcript:

Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir Friedman and Edison Liu.)

Problem: Ortholog mapping Pair genes in one organism with their equivalent counterparts in another Useful for supporting medical research using animal models

A little molecular biology DNA has nucleotides (A, C, T and G) arranged linearly along chromosomes Regions of DNA, called genes, encode proteins Proteins biochemical workhorses Proteins made up of amino acids also strung together linearly fold up to form 3D structure

Mutations and evolution Speciation often roughly as follows: one species separated into two populations separate populations’ genomes drift apart through mutation important parts (e.g. genes) drift less Orthologs have common evolutionary ancestor Genes sometimes copied original retains function copy drifts or dies out Both fine-grained and coarse-grained mutations

Evidence of orthology (protein) sequence similarity comparison with third organism conservation of synteny......

Conserved synteny Neighbor relationships often preserved Consequently, similarity among their neighbors evidence that a pair of genes are orthologs

Plan Identify numerical features corresponding to sequence similarity common similarity to third organism conservation of synteny “Learn” mapping from feature values to prediction

Problem – no “gold standard” for mouse-human orthology, Jackson database reasonable for human-zebrafish? human-pombe?

Another “no gold standard” problem: protein-protein interactions Sources of evidence: Yeast two-hybrid Rosetta Stone Phage display All yield errors......

Related Theoretical Work [MV95] – Problem Goal: given m training examples generated as below output accurate classifier h Training example generation: All variables {0,1}-valued Y chosen randomly, fixed X 1,...,X n chosen independently with Pr(X i = Y) = p i, where p i is unknown, same when Y is 0 or 1 (crucial for analysis) only X 1,...,X n given to training algorithm

Related Theoretical Work [MV95] – Results If n ≥ 3, can approach Bayes error (best possible for source) as m gets large Idea: variable “good” if often agrees with others can e.g. solve for Pr(X 1 = Y) as function of Pr(X 1 = X 2 ), Pr(X 1 = X 3 ), and Pr(X 2 = X 3 ) can estimate Pr(X 1 = X 2 ), Pr(X 1 = X 3 ), and Pr(X 2 = X 3 ) from the training data can plug in to get estimates of Pr(X 1 = Y),...,Pr(X n = Y) can use resulting estimates of Pr(X 1 = Y),...,Pr(X n = Y) to approximate optimal classifier for source

In our problem(s)... Pr(Y = 1) small X 1,...,X n continuous-valued Reasonable to assume X 1,...,X n conditionally independent given Y Reasonable to assume Pr(Y = 1 | X i = x) increasing in x, for all i Sufficient to sort training examples in order of associated conditional probabilities that Y = 1

Key Idea Suppose Pr(Y = 1) known For variable i, Set threshold so that Pr(U i = 1) = Pr(Y = 1) Then Pr(Y = 1 and U i = 0) = Pr(Y = 0 and U i = 1) Can solve for these error probabilities for all i in terms of probabilities U i ’s agree, U i = 1 U i = 0

Final Plan (informal) Assume various values of Pr(Y = 1); predict orthologs given each For pairs of genes predicted to be orthologs even when Pr(Y = 1) assumed small, confidently predict orthology For pairs of genes predicted to be orthologs only when Pr(Y = 1) assumed pretty big, predict orthology more tentatively

Final Plan – Probabilistic Viewpoint Consider hidden variable Z : takes values uniformly distributed in [0,1] interpretation: “obviously orthologous” Assumptions Pr(Y = 1| Z = z) increasing in z For all z, Pr(Z ≥ z | X i = x ) increasing in x For various z Let V z = 1 if Z ≥ z, V z = 0 otherwise Let U z,i = 1 if X i ≥ θ z,i, U z,i = 0 otherwise, where θ z,i chosen so that Pr(U z,i = 1) = Pr(V z = 1) Interpretations: V z is “In the top 100(1- z )% most likely to have Y = 1 overall” U z,i “In the top 100(1- z ) % most likely to have Y = 1 given X i ”

Final Plan - Algorithm Estimate conditional probability that V z = 1, i.e. that Z ≥ z, given each training example, using estimated probabilities pairs of U z,i ’s agree Add to estimate Z ’s; sort by estimates.

Practical problem Small errors in estimates of Pr(U z,i = U z,j ) ’s can lead to large errors in estimates of Pr(U z,i = V z )’s (in fact, program crashes). Solution: when Pr(V z = 1 ) small is important case (confident predictions) can approximate: Pr(U z,i ≠ V z ) ~ ½ (Pr(U z,i ≠ U z,j ) + Pr(U z,i ≠ U z,k ) - Pr(U z,j ≠ U z,k )).

Evaluation: Artificial Source Examples generated using randomly chosen probability distribution: Pr(Y z = 1) = 0.1, n = 5 For each i, choose μ i uniformly from [min,max] set distributions for i th variable: Pr(X i | Y=0) = N(-μ i,1), Pr(X i | Y=1) = N(μ i,1). Evaluate using area under the ROC curve Repeat 100 times, average

ROC curve False positives True positives 1 1 Area under the ROC curve

Results: Artificial Source mmin μmax μ peer AUC opt (w/ Y ’s)

Evaluation: mouse-human ortholog mapping Use Jackson mouse-human ortholog database as “gold standard” Apply algorithm, post-processing to map each gene to unique ortholog Compare with analogous BLAST-only algorithm Plot ROC curve Treat anything not in database as non-ortholog some “false positives” in fact correct error rate overestimated

Results: mouse-human ortholog mapping

Open problems Given our assumptions, is there an algorithm for learning using random examples that always approaches the optimal AUC given knowledge of the source? Is discretizing the independent variables necessary? How does our method compare with other natural algorithms? (E.g. what about algorithms based on clustering?)