Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie

Motivation Noncoding RNA genes can be anywhere Noncoding RNA genes can do anything

Location rRNA, snRNA Exons? Introns Viral vectors

Function

Function, pt. 2

Overview “RSEARCH: Finding homologs of single structured RNA sequences” by Klein and Eddy (2003) “Pairwise RNA Structure Comparison with Stochastic Context-Free Grammars” by Holmes and Rubin (2002)

Comparison - Methodology RSEARCHDART (Stemloc) Sequence

Comparison, Pt. 2 - Uses RSEARCH Find parts of a genome which may be homologous to query sequence More practical in comparative genomics DART (Stemloc) Investigate a specific sequence suspected of being homologous to query sequence

Comparison, Pt. 3 - Complexity RSEARCH O((M - B)LD + BLD 2 ) to scan O(M 4 ) to calculate statistics DART (Stemloc) Between O(LM) and O(L 3 M 3 )

Background: Context Free Grammars Four-tuple {N, T, S, P} N is a set of nonterminals T is a set of terminals S is the start symbol, S  N P is a set of productions

Context Free Grammars, pt. 2 Sample Grammar N = {S, A, B} T = {a, u, c, g,  } P = { S -> A | B, A -> aAc | aBc | g, B -> g }

Context Free Grammars, pt. 3 Parse Trees Parse: aagcc S A A g ca ca S A A g ca ca B

Stochastic CFG Each production associated with a probability Probabilities for all productions starting from a given nonterminal sum to one Superset of HMM Assigns a probability to a parse E.g. S -> A, 0.3 | B, 0.7

Pairwise (profile) SCFG Terminals in each production can exist in each of two strings E.g. W -> x i y k Vx j y l

RSEARCH: pSCFG Simplified Each secondary structure specifies (most of) a grammar, creating a “Model Architecture” Eschews probabilistic interpretation Problem becomes fitting target to model architecture Sequence

Node Types vs. Node States Nodes types are what we want to do given model (e.g. MATP is match pair) Node state represents what happens when scanning a target sequence E.g. Node type is MATP, target sequence doesn’t have a pair in that location -> insert a gap

Node States Set of node states possible for node type

Gap Classes Gap class per node type/state pair

Transition Scores Gap class determines transition scores Gap penalties are affine

Emission Scores Emission scores determined empirically

Parameterizing the Model Emission Scores Substitution Matrices Scores are observed / random

RIBOSUM Matrices Start with MSA Whose MSA? RIBOSUM[X, Y] Sequences X% identical are reweighted to sum to 1 Only sequences Y% identical are counted in making matrices

Model Parameters Gap open penalty (single and pair) Gap extension penalty (single and pair) Internal start penalty Internal end penalty

Solution Guess and check “We might have been able to derive a more robust parameter set had we used a more comprehensive set of tests, but the long running time required by RSEARCH makes such an approach infeasible.”

Digression: Biostatistics Confidence intervals Expectation values

Gumbel Distribution Parameterized by and K E = KNe - x, P = 1 - e -E

Gumbel Distriubtion, pt. 2 K and depend on G+C content of target database For database with heterogeneous G+C content, compute K and for G+C bins

Putting it All Together Run against database substrings of length two times the query Greedily take K best, non-overlapping hits Recover alignments Report: score, position in database, alignment, E-value, P-value Statistics need to be calculated for every query and target database

Time For a 113 nt sequence against 2.1 * 10 7 nt database, 2.9 CPU days. 2% computing statistics For a 330 nt sequence against 2.1 * 10 7 nt database, 38 CPU days. 7% computing statistics Parallelized to 33 minutes and 7.4 hours respectively

Shifting Gears Fold Envelopes Pre-enumerates pSCFGs search space Presents conditional versions of dynamical programming algorithms User defined complexity

Fold Envelopes, pt. 2 Conceptualize search over grammars and parse trees Each node in tree accounts for subsequence WuWu … Accounts for X i..j … Accounts for X 0..i and X j..L Outside sequence Inside sequence

Analogy: Message Passing Inside algorithm: likelihood of sequence over all possible parses Cocke-Younger-Kasami algorithm: maximum likelihood parse of a sequence Inside-Outside algorithm: expected number each grammar production is used Use fold envelopes to limit messages by restricting subsequences considered

The Inside Algorithm To compute a(i, j, V) = P(x i …x j, produced by V) a(i, j, v) =  X  Y  k a(i, k, X) a(k+1, j, Y) P(V  XY) kk+1 i j V XY Batzolgou

Constructing Fold Envelopes Constrain to possible 2ndary structures Constrain to primary sequence alignment

Summary RSEARCH to find a set of possible homologs, sorted by score and statistics Fold Envelopes permit greater search depth in case of unfolded comparisons RSEARCH employs simplified pSCFGs Fold Envelopes are useful over full spectrum of comparisons but represent more computationally complex situations

Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Similar presentations

Presentation on theme: "Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Similar presentations

Presentation on theme: "Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie."— Presentation transcript:

Similar presentations

About project

Feedback