Presentation is loading. Please wait.

Presentation is loading. Please wait.

Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Similar presentations


Presentation on theme: "Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie."— Presentation transcript:

1 Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie

2 Motivation Noncoding RNA genes can be anywhere Noncoding RNA genes can do anything

3 Location rRNA, snRNA Exons? Introns Viral vectors

4 Function

5 Function, pt. 2

6 Overview “RSEARCH: Finding homologs of single structured RNA sequences” by Klein and Eddy (2003) “Pairwise RNA Structure Comparison with Stochastic Context-Free Grammars” by Holmes and Rubin (2002)

7 Comparison - Methodology RSEARCHDART (Stemloc) Sequence

8 Comparison, Pt. 2 - Uses RSEARCH Find parts of a genome which may be homologous to query sequence More practical in comparative genomics DART (Stemloc) Investigate a specific sequence suspected of being homologous to query sequence

9 Comparison, Pt. 3 - Complexity RSEARCH O((M - B)LD + BLD 2 ) to scan O(M 4 ) to calculate statistics DART (Stemloc) Between O(LM) and O(L 3 M 3 )

10 Background: Context Free Grammars Four-tuple {N, T, S, P} N is a set of nonterminals T is a set of terminals S is the start symbol, S  N P is a set of productions

11 Context Free Grammars, pt. 2 Sample Grammar N = {S, A, B} T = {a, u, c, g,  } P = { S -> A | B, A -> aAc | aBc | g, B -> g }

12 Context Free Grammars, pt. 3 Parse Trees Parse: aagcc S A A g ca ca S A A g ca ca B

13 Stochastic CFG Each production associated with a probability Probabilities for all productions starting from a given nonterminal sum to one Superset of HMM Assigns a probability to a parse E.g. S -> A, 0.3 | B, 0.7

14 Pairwise (profile) SCFG Terminals in each production can exist in each of two strings E.g. W -> x i y k Vx j y l

15 RSEARCH: pSCFG Simplified Each secondary structure specifies (most of) a grammar, creating a “Model Architecture” Eschews probabilistic interpretation Problem becomes fitting target to model architecture Sequence

16

17 Node Types vs. Node States Nodes types are what we want to do given model (e.g. MATP is match pair) Node state represents what happens when scanning a target sequence E.g. Node type is MATP, target sequence doesn’t have a pair in that location -> insert a gap

18 Node States Set of node states possible for node type

19 Gap Classes Gap class per node type/state pair

20 Transition Scores Gap class determines transition scores Gap penalties are affine

21 Emission Scores Emission scores determined empirically

22

23 Parameterizing the Model Emission Scores Substitution Matrices Scores are observed / random

24 RIBOSUM Matrices Start with MSA Whose MSA? RIBOSUM[X, Y] Sequences X% identical are reweighted to sum to 1 Only sequences Y% identical are counted in making matrices

25 Model Parameters Gap open penalty (single and pair) Gap extension penalty (single and pair) Internal start penalty Internal end penalty

26 Solution Guess and check “We might have been able to derive a more robust parameter set had we used a more comprehensive set of tests, but the long running time required by RSEARCH makes such an approach infeasible.”

27 Digression: Biostatistics Confidence intervals Expectation values

28 Gumbel Distribution Parameterized by and K E = KNe - x, P = 1 - e -E

29 Gumbel Distriubtion, pt. 2 K and depend on G+C content of target database For database with heterogeneous G+C content, compute K and for G+C bins

30 Putting it All Together Run against database substrings of length two times the query Greedily take K best, non-overlapping hits Recover alignments Report: score, position in database, alignment, E-value, P-value Statistics need to be calculated for every query and target database

31 Time For a 113 nt sequence against 2.1 * 10 7 nt database, 2.9 CPU days. 2% computing statistics For a 330 nt sequence against 2.1 * 10 7 nt database, 38 CPU days. 7% computing statistics Parallelized to 33 minutes and 7.4 hours respectively

32 Shifting Gears Fold Envelopes Pre-enumerates pSCFGs search space Presents conditional versions of dynamical programming algorithms User defined complexity

33 Fold Envelopes, pt. 2 Conceptualize search over grammars and parse trees Each node in tree accounts for subsequence WuWu … Accounts for X i..j … Accounts for X 0..i and X j..L Outside sequence Inside sequence

34 Analogy: Message Passing Inside algorithm: likelihood of sequence over all possible parses Cocke-Younger-Kasami algorithm: maximum likelihood parse of a sequence Inside-Outside algorithm: expected number each grammar production is used Use fold envelopes to limit messages by restricting subsequences considered

35 The Inside Algorithm To compute a(i, j, V) = P(x i …x j, produced by V) a(i, j, v) =  X  Y  k a(i, k, X) a(k+1, j, Y) P(V  XY) kk+1 i j V XY Batzolgou

36 Constructing Fold Envelopes Constrain to possible 2ndary structures Constrain to primary sequence alignment

37 Summary RSEARCH to find a set of possible homologs, sorted by score and statistics Fold Envelopes permit greater search depth in case of unfolded comparisons RSEARCH employs simplified pSCFGs Fold Envelopes are useful over full spectrum of comparisons but represent more computationally complex situations


Download ppt "Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie."

Similar presentations


Ads by Google