Project 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May 16, 2001.

Project 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May 16, 2001

Outline Goals of the project Quick Review of background material Data input and parsing The inside algorithm The Cocke-Younger-Kasami algorithm Implementation details and results

Goals for this project Build a user interface for easy definition of the grammar Read grammar into memory and compute –a. probability that the specified grammar produced a sample sequence –b. the most probable parse tree for that sequence Remaining issue: Parameter Re-estimation Implement a stochastic context free grammar Model a small sequence using a sample SCFG

Quick Review Context Free Grammar: W=>  (1 non terminal)=>(Any number of terminals/non- terminals) Same CFG in Chomsky Normal Form only W v => W x W y or W v => a (terminal) (1 non terminal)=>(two non-terminals or 1 terminal) Any CFG can be put into Normal Form by adding additional non-terminals Choose Normal form for Computational ease

S  ABC (0.9) A  a(0.5) B  b(0.8)C  c(0.6) S  AD (0.9) D  BC (1.0) A  a (0.5) B  b (0.8) C  c (0.6) Stochastic Context Free Grammar in Normal Chomsky Form All productions have associated transition probabilities Given a grammar in this form and a sample sequence we want to compute the probability that this grammar produced the sequence - inside algorithm find the optimal parse tree through this grammar that results in the sample sequence - CYK algorithm

W2 WM a b c W3 W1 Terminal Symbols Non-terminal productions, indexed for easy lookup Grammar stored in linked lists index W4

Solving the problem 1. Users input their grammar in normal form 2. Grammar written to file SAD$0.9 (S=>A D with probability( 0.9) DBC$1.0 Aa$0.5 Bb$0.8 Cc$0.6 * One line per production rule * Grammar starts with start symbol * Each line denotes a transition between non-terminals or between a non-terminal and a terminal * probability a transition is given after the $ symbol

Purpose: Compute  (i,j, ) the probability of a subtree rooted with the non- terminal deriving the subword ( x i … x j ) of the sequence (x 1 ….X L ) given the grammar G  (i,j, ) = P(  * x i … x j |G) computed in a recursive manner from the bottom up starting with subwords of length one. The inside algorithm

L1ikk+1 j V y z Initialisation: for i = 1 to L, v 1 to M  (i,i,v)= e v (x i ) Iteration:for i=1to L-1, j=i+1, v=1 to M  (i,i,v)=  y=1,M  z=1,M  k=i,j-1  (i,k,y)  (k+1,j,z)t v (y,z) W v => W x W y t v (y,z) W v => a e v (a)

References 1. [Brown and Wilson, 1995] Brown, M.P.S. and Wilson, C. Rna pseudoknot modeling using intersections of stochastic context free grammars with applications to database search. In Hunter, L. and Klein, T., editors. Pacific Synposium on Biocomputing, pages 109-125 2. [Brown 1999] Brown, M.P.S., “RNA Modeling Using Stochastic Context- Free Grammars”, ph.D thesis. 3. [Eddy and Durbin, 1994] Eddy, S. R. and Durbin, R. (1994). RNA sequence analysis using covariance models. NAR, 22:2079-2088. 4. [Krogh et al., 1994] Krogh, A., Brown, M., Mian, I. S., Sjolander, K., and Haussler, D. Hidden Markov models in computational biology: Applications to protein modeling. JMB, 235:1501-1531. 5. [Lowe and Eddy, 1999] Lowe, T. and Eddy, S. A computational screen for methylation guide snornas in yeast. Science, 283:1168-1171. 6. [Sakakibara et al., 1994] Sakakibara, Y., Brown, M., Hughey, R., Mian, I. S., Sjolander, K., Underwood, R. C., and Haussler, D. Stochastic context-free grammars for tRNA modeling. NAR, 22:5112-5120. 7. [Underwood, 1994] Underwood, R. C. Stochastic context-free grammars for modeling three spliceosomal small nuclear ribonucleic acids. Master thesis, University of California, Santa Cruz.

Outside Algorithm The outside probability,  (I,j,v), is the probability that starting from the start non-terminal the non-terminal v is generated and the string not dominated by it is ( x 1 …x i-1 ) to the left and ( x j+1 …x L ) to the right.  (i,j,v) = P( S  * x 1 … x i-1 vx j+1 … x L |G). The outside variable can be computed in a recursive manner starting with the largest excluded subsequence i-1 L  (i,j,v) =     (k,i-1,z)  (k,j,y) t y (z,v) +     (j+1,k,z)  (i,k,y) t y (v,z). y z k=1 y z k=j+I The probability that a non-terminal v derives the subword (i, j) is given as  (i,j,v)  (i,j,v)/P(x|G)

W2 WM a b c W3 W1 Terminal Symbols Non-terminal productions, indexed for easy lookup Grammar stored in linked lists index W4

Why SCFG ? More Powerful –evolution processes of mutation, insertion, deletion –interaction between basepairs C A A A G A C G G C A U C G G C U A GACGCAAGUC UCGGAAACGA

Some Application of SCFG modeling t-RNA – [Sakakibara et al., 1994a] – [Eddy and Durbin, 1994] snRNAs – [Underwood, 1994] a pseudoknotted biotin binder – [Brown and Wilson, 1995] snoRNA – [Lowe and Eddy, 1999] small subunit ribosomal RNA – [Brown, 1999]

3. Generate e and t e: probability for rules like W  a t: probability for rules like W  XY 4. Compute  (i,j, ) –the probability of a subtree rooted with the non-terminal deriving the subword ( x i … x j )  (i,j, ) = P(  * x i … x j |G) –computed recursively in a bottom up fashion starting with subwords of length one. M M j-1  (i,j, ) =     (i,k,y)  (k+1,j,z)t (y,z), y=1 z=1 k=I Working Procedure

Project 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May 16, 2001.

Similar presentations

Presentation on theme: "Project 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May 16, 2001."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Project 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May 16, 2001.

Similar presentations

Presentation on theme: "Project 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May 16, 2001."— Presentation transcript:

Similar presentations

About project

Feedback