Presentation is loading. Please wait.

Presentation is loading. Please wait.

Two bioinformatics applications of dynamic Bayesian networks William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.

Similar presentations


Presentation on theme: "Two bioinformatics applications of dynamic Bayesian networks William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering."— Presentation transcript:

1 Two bioinformatics applications of dynamic Bayesian networks William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington

2 Outline Segmenting genomic data –Background: DNA, chromatin and DNase I –Simple solution –Wavelets –Hierarchical model Matching peptides to mass spectra –Background: tandem mass spectrometry –Modeling peptide fragmentation

3 Genes Gene domains DnaseI Hypersensitive Site Trans-factor complex Chromatin Fiber Nucleus GenomicDNA Packaged into Chromatin The human genome in vivo

4 Measuring chromatin accessibility

5

6 A simple hidden Markov model Each state contains a single Gaussian. The model has six parameters (two transitions, two means, two standard deviations). The parameters are initialized randomly and trained in an unsupervised fashion via expectation-maximization. EM is re-started 100 times, and we select the parameters that yield the highest likelihood. The original data set is then segmented using either Viterbi or posterior decoding. Open chromatin Closed chromatin very ^

7 1.5 megabases

8 A problem, and two solutions Problem: We are interested in phenomena occurring at multiple scales. Solution #1: Perform a wavelet smooth prior to HMM analysis. Solution #2: Build a more complex probability model.

9

10

11

12

13 Change point model Four-state model: –major DNase hypersensitive site (DHS), –minor DHS, –intermediate sensitivity region, and –insensitive region. Continuous mixture of Gaussians at each state. Gamma distribution of lengths within each region.

14

15 Spanning the gaps Beginning in State 1 (Insensitive)

16 Spanning the gaps Beginning in State 4 (Major DHS)

17 Selecting the number of states

18 Improved fit to the data Each panel is a QQ plot of the difference between the observed residuals and the theoretical Gaussian. InsensitiveIntermediate sensitivity Minor DHSMajor DHS

19 Capturing different scales

20 Enrichment of biologically relevant features

21 Future directions Many types of genomic data –Phylogenetic conservation scores –Various histone modifications –Replication timing, etc. Perform segmentions in multiple dimensions simultaneously. Assign statistical significance to observed segments.

22 Shotgun proteomics Trained Model Test PSMs Training PSMs Probability Model Evaluation PSM = peptide-spectrum match

23 Peptide sequence influences peak height

24 Bayesian network We model peptide fragmentation using a Bayesian network. Nodes represent random variables, and edges represent conditional dependencies. Each node stores a conditional probability table (CPT) giving Pr(node|parents) no b-ion observed b-ion observed intensity > 50%intensity < 50% Is b-ion observed? b-ion intensity

25 Ion series modeled in a Markov chain Is b-ion observed? b-ion intensity Is b-ion observed? b-ion intensity Is b-ion observed? b-ion intensity Is b-ion observed? b-ion intensity Is b-ion observed? b-ion intensity ~ PepHMM (Han et al., 2005).

26 A more realistic model Is b-ion observed? b-ion intensity N-term AA C-term AA Is ion detectable? Fractional m/z Is proton mobile?

27 Ion series modeled in a Markov chain

28 Vectors of log-odds ratios Correct peptide-spectrum matches Incorrect peptide-spectrum matches

29 Binary classifier

30 Model Evaluation: Accuracy ModelRedundant TP/FPUnique TP/FP Bayes Net285/300, 95%137/144, 95.1% SEQUEST288/300, 96%136/144, 94.4% InsPecT274/300, 91.3%131/144, 90.9% Trained Model Test PSMs Training PSMs Probability Model Evaluation

31 An incorrect identification SEQUEST: LRPGAELLEGAHVGNFVEMKBayes net: HQDETQDALNALDLLTNEK Blue = b and y, green = a, red = ammonia loss, magenta = water loss, sienna = +2 This peptide does not appear in E. coli, the organism from which this protein sample was derived.

32 Co-eluting peptides SEQUEST: AFPEAVLFIHPLDAK Bayes net: DVFVHFSALQGNQFK Blue = b and y, green = a, red = ammonia loss, magenta = water loss, sienna = +2

33 Future directions Build a single Bayesian network that includes all ion types. Produce more descriptive outputs from the Bayesian network for input to the classifier. Add more biophysical details to the model: chromatography retention time, a better mass- to-charge estimate, etc. Generate a better (larger, more accurate) gold standard data set.

34 Acknowledgments DNase I hypersensitivity –John Stamatoyannopoulos –Pete Sabo –Scott Kuehn –many others in the Stam lab Wavelet analysis: Bob Thurman Change point model –Charles Lawrence –Heng Lian –William Thompson Mass spectrometry –Aaron Klammer –Jeff Bilmes –Sheila Reynolds –Michael MacCoss


Download ppt "Two bioinformatics applications of dynamic Bayesian networks William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering."

Similar presentations


Ads by Google