Presentation on theme: "Two bioinformatics applications of dynamic Bayesian networks"— Presentation transcript:
1 Two bioinformatics applications of dynamic Bayesian networks William Stafford NobleDepartment of Genome SciencesDepartment of Computer Science and EngineeringUniversity of Washington
2 Outline Segmenting genomic data Matching peptides to mass spectra Background: DNA, chromatin and DNase ISimple solutionWaveletsHierarchical modelMatching peptides to mass spectraBackground: tandem mass spectrometryModeling peptide fragmentation
3 The human genome in vivo Chromatin FiberGene ‘domains’NucleusTrans-factor complexDnaseIHypersensitive SiteGenesGenomicDNAPackaged into Chromatin
6 A simple hidden Markov model very^OpenchromatinClosedchromatinEach state contains a single Gaussian.The model has six parameters (two transitions, two means, two standard deviations).The parameters are initialized randomly and trained in an unsupervised fashion via expectation-maximization.EM is re-started 100 times, and we select the parameters that yield the highest likelihood.The original data set is then segmented using either Viterbi or posterior decoding.
8 A problem, and two solutions Problem: We are interested in phenomena occurring at multiple scales.Solution #1: Perform a wavelet smooth prior to HMM analysis.Solution #2: Build a more complex probability model.
13 Change point model Four-state model: major DNase hypersensitive site (DHS),minor DHS,intermediate sensitivity region, andinsensitive region.Continuous mixture of Gaussians at each state.Gamma distribution of lengths within each region.
21 Future directions Many types of genomic data Phylogenetic conservation scoresVarious histone modificationsReplication timing, etc.Perform segmentions in multiple dimensions simultaneously.Assign statistical significance to observed segments.
22 Shotgun proteomics Training PSMs Test PSMs Trained Model Evaluation ProbabilityModelPSM = peptide-spectrum match
24 Bayesian networkWe model peptide fragmentation using a Bayesian network.Nodes represent random variables, and edges represent conditional dependencies.Each node stores a conditional probability table (CPT) giving Pr(node|parents).Is b-ionobserved?b-ionintensity1.000.00no b-ion observed0.750.25b-ion observedintensity > 50%intensity < 50%
25 Ion series modeled in a Markov chain Is b-ionobserved?Is b-ionobserved?Is b-ionobserved?Is b-ionobserved?Is b-ionobserved?b-ionintensityb-ionintensityb-ionintensityb-ionintensityb-ionintensity~ PepHMM (Han et al., 2005).
26 A more realistic model Is b-ion observed? b-ion intensity N-term AA C-term AAIs iondetectable?Fractionalm/zIs protonmobile?
30 Model Evaluation: Accuracy Training PSMsTest PSMsTrainedModelEvaluationProbabilityModelModelRedundant TP/FPUnique TP/FPBayes Net285/300, 95%137/144, 95.1%SEQUEST288/300, 96%136/144, 94.4%InsPecT274/300, 91.3%131/144, 90.9%
31 An incorrect identification Bayes net: HQDETQDALNALDLLTNEKSEQUEST: LRPGAELLEGAHVGNFVEMKThis peptide does not appear in E. coli, the organism from which this protein sample was derived.Blue = b and y, green = a, red = ammonia loss, magenta = water loss, sienna = +2
32 Co-eluting peptides SEQUEST: AFPEAVLFIHPLDAK Bayes net: DVFVHFSALQGNQFKSEQUEST: AFPEAVLFIHPLDAKBlue = b and y, green = a, red = ammonia loss, magenta = water loss, sienna = +2
33 Future directionsBuild a single Bayesian network that includes all ion types.Produce more descriptive outputs from the Bayesian network for input to the classifier.Add more biophysical details to the model: chromatography retention time, a better mass-to-charge estimate, etc.Generate a better (larger, more accurate) gold standard data set.
34 Acknowledgments DNase I hypersensitivity Wavelet analysis: Bob Thurman John StamatoyannopoulosPete SaboScott Kuehnmany others in the Stam labWavelet analysis: Bob ThurmanChange point modelCharles LawrenceHeng LianWilliam ThompsonMass spectrometryAaron KlammerJeff BilmesSheila ReynoldsMichael MacCoss