Presentation is loading. Please wait.

Presentation is loading. Please wait.

Transformational Grammars “Colourless green ideas sleep furiously” - Noam Chomsky We might ask “Is this novel sentence (or sequence!) grammatical?” i.e.,

Similar presentations


Presentation on theme: "Transformational Grammars “Colourless green ideas sleep furiously” - Noam Chomsky We might ask “Is this novel sentence (or sequence!) grammatical?” i.e.,"— Presentation transcript:

1 Transformational Grammars “Colourless green ideas sleep furiously” - Noam Chomsky We might ask “Is this novel sentence (or sequence!) grammatical?” i.e., does the language described by some grammar validly contain this sentence?? Chomsky turned this question on its head and instead asked: “Could the grammar we’re considering have possibly generated this sentence?” He developed finite formal machines (“grammars”) that can theoretically recursively enumerate the infinitude of possible sentences of the corresponding language.

2 Transformational Grammars The Chomsky hierarchy of grammars The more deeply nested the grammar, the simpler the rules. These are easiest to parse, but are also the most restricted Unrestricted Context-sensitive Context-free Regular Slide after Durbin, et al., 1998

3 Regular Grammars Symbols and Productions (A.K.A “rewriting rules”) All transformational grammars are defined by their set of symbols and the production rules for manipulating strings consisting of those symbols Only two types of symbols: Terminals (generically represented as “a” ) these actually appear in the final observed string (so imagine nucleotide or amino acid symbols) Non-terminals (generically represented as “W” ) abstract symbols – easiest to see how they are used through example. The start state (usually shown as “S” ) is a commonly used non-terminal The non-terminals are often used as place holders that disappear from the final string

4 Regular Grammars Symbols and Productions (A.K.A “rewriting rules”) Only two productions are allowed in a regular grammar! We often also use a special terminal symbol “  ”, which is used to denote the null string and to end a production… W→ aW W→ a W→ W→  Don’t freak out! It’s easier to demonstrate how this all works than it is to describe!

5 Regular Grammars Symbols and Productions (A.K.A “rewriting rules”) W = {S = " Start "} a = { A,G,C,T,  } S→ A SS→ C S S→ G SS→ T S S→  Imagine we always start with S -- then we can repeatedly choose any of the valid productions, with S being replaced each time by the string on the right hand side of the production we’ve chosen… Here’s a trivial regular grammar that can produce all possible nucleotide sequences:

6 Regular Grammars Symbols and Productions (A.K.A “rewriting rules”) W = {S = " Start "} a = { A,G,C,T,  } S→ AS|CS|GS|TS|S→ AS|CS|GS|TS| Imagine we always start with S -- then we can repeatedly choose any of the valid productions, with S being replaced each time by the string on the right hand side of the production we’ve chosen… Here’s a trivial regular grammar that can produce all possible nucleotide sequences:

7 Protein motifs as regular grammars “Classic” PROSITE motifs S → rW 1 | kW 1 W 1 → gW 2 W 2 → [afilmnqstvwy]W 3 W 3 → [agsci]W 4 W 4 → fW 5 | yW 5 W 5 → lW 6 | iW 6 | vW 6 | aW 6 W 6 → [acdefghiklmnpqrstvwy]W 7 W 8 → f | y | m RU1A_HUMANSRSLKMRGQAFVIFKEVSSAT SKLF_DROMEKLTGRPRGVAFVRYNKREEAQ ROC_HUMANVGCSVHKGFAFVQYVNERNAR ELAV_DROMEGNDTQTKGVGFIRFDKREEAT RNP-1 Motif Slide after Durbin, et al., 1998 [RK]-G-{EDRKHPCG}-[AGSCI]-[FY]-[LIVA]-x-[FYM] Does this remind you of anything we’ve seen before?

8 Automata Formal grammars are generative. However, each Chomsky grammar can be parsed using a corresponding abstract computational machine, or automata The automata for the two most general grammars are of great theoretical interest but are of less practical significance for us because of the time and space complexity of the algorithms – their decision problems may only be computationally feasible in special cases. We will focus on the first two only!! Grammar Parsing automaton Regular grammar Context-free grammar Context-sensitive grammar Unrestricted grammar Finite State automaton Push-down automaton Linear bounded automaton Turing machine

9 Trinucleotide Repeat Disorders A family of diseases resulting from a trinucleotide expansion Can we identify sequences with well-defined repeat characteristics? Fragile X – associated with 200 to 4000 repeats of a CGG trinucleotide in the FMR-1 gene Unaffected individuals have typically 5-40 copies, but individuals with intermediate numbers are considered to have a “premutation” with variable penetrance CAG Repeats – at least 9 different “PolyQ” disorders have been identified so far. Most are autosomal dominant Huntington disease – affected individuals have >35 copies of the CAG repeat in the HD (huntington Disease) gene

10 A Finite State Automaton The FMR triplet repeat considered as a sequence of states The grammar generates, the automaton parses 12345678 S  ggggcgctc c a The FMR triplet regular grammar: S → gW 1 W 1 → cW 2 W 2 → gW 3 W 3 → cW 4 W 4 → gW 5 W 5 → gW 6 W 6 → cW 7 | aW 4 | cW 4 W 7 → gW 6 W 8 → g

11 A Finite State Automaton The FMR triplet repeat considered as a sequence of states FSAs can be either deterministic, or non-deterministic. Because our FMR repeat FSA offers multiple paths for accepting state 6, this is a non-deterministic FSA. An automaton with only one possible sequence of states (the “state path”) is always deterministic. 12345678 S  ggggcgctc c a Note however that there are no probabilities associated with the state transitions. This FSA is therefore NOT a probabilistic model or stochastic model.

12 Finite State Automata Moore vs. Mealy machines The FSA shown above is a so-called “Mealy machine” -- Mealy machines “accept” or “emits” upon transition to a new state Later we will see and use examples of “Moore machines” -- Moore machines instead “accept on state” 12345678 S  ggggcgctc c a Moore and Mealy machine are always interconvertible. Think about ways to redraw this FSA as a Moore Machine

13 Finite State Automata The FMR regular grammar as a Python data structure This is just one possible embodiment! This dict has keys that are states, and values that are lists of “acceptance conditions”. The acceptance conditions are in the format of a tuple with the symbol that would lead to acceptance, and the state that should be “transitioned to”. states = { "Start" : [("G", "W1")], "W1" : [("C", "W2")], "W2" : [("G", "W3")], "W3" : [("C", "W4")], "W4" : [("G", "W5")], "W5" : [("G", "W6")], "W6" : [("C", "W7"), ("A", "W4"), ("C", "W4")], "W7" : [("T", "W8")], "W8" : [("G", "End")] }

14 Reducing an FSA to Python code The deterministic case This is fairly straightforward: initialize cur_state to “Start” initialize cur_position in test sequence to zero Initialize result_string to “” Iterate over positions in sequence: is the symbol at cur_position a valid production? No? Failure. Return False Yes! Accept symbol set cur_state to new_state is cur_state now “End”? Yes! Success! Return result_str concatenate symbol at cur_position to result_str Exhausted test sequence? Failure. Return False

15 Reducing an FSA to Python code The non-deterministic case is less straightforward! We can no longer just iterate over the test sequence! For each symbol in the test sequence, we might have to consider multiple valid productions (think loop, yes?) We therefore may need to explore “branches” corresponding to these alternatives before we find one that is “correct” Although not necessarily the most efficient way, recursion is an easy way to explore these branches: If a possible production is valid, assume that it is correct by accepting the symbol and new state Increment the position in the test sequence “Success” or “Failure” can easily be propagated back up through the recursion by testing the result of the recursive call and returning the resulting return sequence. If it gets past the recursive call test, the branch has failed, decrement the position in the test sequence, and go to the next possible production If there are no more productions to consider, we’ve failed, return False

16 Python focus – classes Like functions, minimally, all we need is a statement block of Perl code that we have given a name! Defining a class class I_dont_do_much (object): #any code you like!! pass Capital letters OK …but it won’t do anything interesting though until we have specified some data and some methods! Python classes are essentially user-defined data types

17 Python focus – classes This method corresponds to the “constructor” in other OOP languages The __init__ method class FSA (object): def __init__(self, states): self.states = states First argument always self Variables declared outside of a method have the same value in all instances of that class! Variables prepended with self become instance variables, and are visible throughout the namespace of a class instance Methods defined as functions

18 Python focus – classes User-defined methods class FSA (object): def __init__(self, states): self.states = states # initialize some other stuff def test (self, seq, cur_state = “Start”): some_var = 0 # do some things return something First argument always self Variable some_var is visible only within the user defined test method! User methods defined as functions These are the interface with your class

19 Python focus – classes Using classes my_FSA = FSA(my_state_dict) result = myFSA.test(“AGCTGGGGTTTAATT”) Instantiate a class by invoking its name, and providing the arguments the __init__ method expects We can make as many instances of a class as we need! Invoke class methods just by using the instance identifier in conjunction with the method name using attribute notation!


Download ppt "Transformational Grammars “Colourless green ideas sleep furiously” - Noam Chomsky We might ask “Is this novel sentence (or sequence!) grammatical?” i.e.,"

Similar presentations


Ads by Google