Transformational Grammars “Colourless green ideas sleep furiously” - Noam Chomsky We might ask “Is this novel sentence (or sequence!) grammatical?” i.e.,

Slides:



Advertisements
Similar presentations
Formal Languages: main findings so far
Advertisements

Formal Languages: main findings so far A problem can be formalised as a formal language A formal language can be defined in various ways, e.g.: the language.
CS 345: Chapter 9 Algorithmic Universality and Its Robustness
Natural Language Processing - Formal Language - (formal) Language (formal) Grammar.
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Transformational Grammars The Chomsky hierarchy of grammars Context-free grammars describe languages that regular grammars can’t Unrestricted Context-sensitive.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
CS5371 Theory of Computation
PZ02A - Language translation
CS 330 Programming Languages 09 / 18 / 2007 Instructor: Michael Eckmann.
Transformational grammars
Linear Bounded Automata LBAs
79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression.
Normal forms for Context-Free Grammars
Finite State Machines Data Structures and Algorithms for Information Processing 1.
A shorted version from: Anastasia Berdnikova & Denis Miretskiy.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Fall 2003Costas Busch - RPI1 Turing Machines (TMs) Linear Bounded Automata (LBAs)
PZ03A Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ03A - Pushdown automata Programming Language Design.
Grammars, Languages and Finite-state automata Languages are described by grammars We need an algorithm that takes as input grammar sentence And gives a.
Chapter 4 Context-Free Languages Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.
CMPS 3223 Theory of Computation Automata, Computability, & Complexity by Elaine Rich ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Slides provided.
::ICS 804:: Theory of Computation - Ibrahim Otieno SCI/ICT Building Rm. G15.
Some Probability Theory and Computational models A short overview.
Grammars CPSC 5135.
Transformational Grammars and PROSITE Patterns Roland Miezianko CIS Bioinformatics Prof. Vucetic.
Copyright © by Curt Hill Grammar Types The Chomsky Hierarchy BNF and Derivation Trees.
Copyright © Curt Hill Languages and Grammars This is not English Class. But there is a resemblance.
Parsing Introduction Syntactic Analysis I. Parsing Introduction 2 The Role of the Parser The Syntactic Analyzer, or Parser, is the heart of the front.
Python uses boolean variables to evaluate conditions. The boolean values True and False are returned when an expression is compared or evaluated.
1 Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
PZ03A Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ03A - Pushdown automata Programming Language Design.
CSC312 Automata Theory Lecture # 26 Chapter # 12 by Cohen Context Free Grammars.
1 Course Overview Why this course “formal languages and automata theory?” What do computers really do? What are the practical benefits/application of formal.
Lecture 16b Turing Machines Topics: Closure Properties of Context Free Languages Cocke-Younger-Kasimi Parsing Algorithm June 23, 2015 CSCE 355 Foundations.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 12 Mälardalen University 2007.
Formal grammars A formal grammar is a system for defining the syntax of a language by specifying sequences of symbols or sentences that are considered.
Week 14 - Friday.  What did we talk about last time?  Simplifying FSAs  Quotient automata.
(Really) Basic Computer Science James A. Foster U. Idaho, IBEST.
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
Context-Free Grammars: an overview
Linear Bounded Automata LBAs
Pushdown automata Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Complexity and Computability Theory I
Automata and Languages What do these have in common?
Natural Language Processing - Formal Language -
Context Sensitive Grammar & Turing Machines
Context Sensitive Languages and Linear Bounded Automata
PZ03A - Pushdown automata
Course 2 Introduction to Formal Languages and Automata Theory (part 2)
CSE322 Chomsky classification
Hierarchy of languages
CSE322 The Chomsky Hierarchy
A HIERARCHY OF FORMAL LANGUAGES AND AUTOMATA
CHAPTER 2 Context-Free Languages
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Pushdown automata Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Pushdown automata Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Pushdown automata Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Sub: Theoretical Foundations of Computer Sciences
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
The Chomsky Hierarchy Costas Busch - LSU.
Pushdown automata Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
COMPILER CONSTRUCTION
Pushdown automata Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Presentation transcript:

Transformational Grammars “Colourless green ideas sleep furiously” - Noam Chomsky We might ask “Is this novel sentence (or sequence!) grammatical?” i.e., does the language described by some grammar validly contain this sentence?? Chomsky turned this question on its head and instead asked: “Could the grammar we’re considering have possibly generated this sentence?” He developed finite formal machines (“grammars”) that can theoretically recursively enumerate the infinitude of possible sentences of the corresponding language.

Transformational Grammars The Chomsky hierarchy of grammars The more deeply nested the grammar, the simpler the rules. These are easiest to parse, but are also the most restricted Unrestricted Context-sensitive Context-free Regular Slide after Durbin, et al., 1998

Regular Grammars Symbols and Productions (A.K.A “rewriting rules”) All transformational grammars are defined by their set of symbols and the production rules for manipulating strings consisting of those symbols Only two types of symbols: Terminals (generically represented as “a” ) these actually appear in the final observed string (so imagine nucleotide or amino acid symbols) Non-terminals (generically represented as “W” ) abstract symbols – easiest to see how they are used through example. The start state (usually shown as “S” ) is a commonly used non-terminal The non-terminals are often used as place holders that disappear from the final string

Regular Grammars Symbols and Productions (A.K.A “rewriting rules”) Only two productions are allowed in a regular grammar! We often also use a special terminal symbol “  ”, which is used to denote the null string and to end a production… W→ aW W→ a W→ W→  Don’t freak out! It’s easier to demonstrate how this all works than it is to describe!

Regular Grammars Symbols and Productions (A.K.A “rewriting rules”) W = {S = " Start "} a = { A,G,C,T,  } S→ A SS→ C S S→ G SS→ T S S→  Imagine we always start with S -- then we can repeatedly choose any of the valid productions, with S being replaced each time by the string on the right hand side of the production we’ve chosen… Here’s a trivial regular grammar that can produce all possible nucleotide sequences:

Regular Grammars Symbols and Productions (A.K.A “rewriting rules”) W = {S = " Start "} a = { A,G,C,T,  } S→ AS|CS|GS|TS|S→ AS|CS|GS|TS| Imagine we always start with S -- then we can repeatedly choose any of the valid productions, with S being replaced each time by the string on the right hand side of the production we’ve chosen… Here’s a trivial regular grammar that can produce all possible nucleotide sequences:

Protein motifs as regular grammars “Classic” PROSITE motifs S → rW 1 | kW 1 W 1 → gW 2 W 2 → [afilmnqstvwy]W 3 W 3 → [agsci]W 4 W 4 → fW 5 | yW 5 W 5 → lW 6 | iW 6 | vW 6 | aW 6 W 6 → [acdefghiklmnpqrstvwy]W 7 W 8 → f | y | m RU1A_HUMANSRSLKMRGQAFVIFKEVSSAT SKLF_DROMEKLTGRPRGVAFVRYNKREEAQ ROC_HUMANVGCSVHKGFAFVQYVNERNAR ELAV_DROMEGNDTQTKGVGFIRFDKREEAT RNP-1 Motif Slide after Durbin, et al., 1998 [RK]-G-{EDRKHPCG}-[AGSCI]-[FY]-[LIVA]-x-[FYM] Does this remind you of anything we’ve seen before?

Automata Formal grammars are generative. However, each Chomsky grammar can be parsed using a corresponding abstract computational machine, or automata The automata for the two most general grammars are of great theoretical interest but are of less practical significance for us because of the time and space complexity of the algorithms – their decision problems may only be computationally feasible in special cases. We will focus on the first two only!! Grammar Parsing automaton Regular grammar Context-free grammar Context-sensitive grammar Unrestricted grammar Finite State automaton Push-down automaton Linear bounded automaton Turing machine

Trinucleotide Repeat Disorders A family of diseases resulting from a trinucleotide expansion Can we identify sequences with well-defined repeat characteristics? Fragile X – associated with 200 to 4000 repeats of a CGG trinucleotide in the FMR-1 gene Unaffected individuals have typically 5-40 copies, but individuals with intermediate numbers are considered to have a “premutation” with variable penetrance CAG Repeats – at least 9 different “PolyQ” disorders have been identified so far. Most are autosomal dominant Huntington disease – affected individuals have >35 copies of the CAG repeat in the HD (huntington Disease) gene

A Finite State Automaton The FMR triplet repeat considered as a sequence of states The grammar generates, the automaton parses S  ggggcgctc c a The FMR triplet regular grammar: S → gW 1 W 1 → cW 2 W 2 → gW 3 W 3 → cW 4 W 4 → gW 5 W 5 → gW 6 W 6 → cW 7 | aW 4 | cW 4 W 7 → gW 6 W 8 → g

A Finite State Automaton The FMR triplet repeat considered as a sequence of states FSAs can be either deterministic, or non-deterministic. Because our FMR repeat FSA offers multiple paths for accepting state 6, this is a non-deterministic FSA. An automaton with only one possible sequence of states (the “state path”) is always deterministic S  ggggcgctc c a Note however that there are no probabilities associated with the state transitions. This FSA is therefore NOT a probabilistic model or stochastic model.

Finite State Automata Moore vs. Mealy machines The FSA shown above is a so-called “Mealy machine” -- Mealy machines “accept” or “emits” upon transition to a new state Later we will see and use examples of “Moore machines” -- Moore machines instead “accept on state” S  ggggcgctc c a Moore and Mealy machine are always interconvertible. Think about ways to redraw this FSA as a Moore Machine

Finite State Automata The FMR regular grammar as a Python data structure This is just one possible embodiment! This dict has keys that are states, and values that are lists of “acceptance conditions”. The acceptance conditions are in the format of a tuple with the symbol that would lead to acceptance, and the state that should be “transitioned to”. states = { "Start" : [("G", "W1")], "W1" : [("C", "W2")], "W2" : [("G", "W3")], "W3" : [("C", "W4")], "W4" : [("G", "W5")], "W5" : [("G", "W6")], "W6" : [("C", "W7"), ("A", "W4"), ("C", "W4")], "W7" : [("T", "W8")], "W8" : [("G", "End")] }

Reducing an FSA to Python code The deterministic case This is fairly straightforward: initialize cur_state to “Start” initialize cur_position in test sequence to zero Initialize result_string to “” Iterate over positions in sequence: is the symbol at cur_position a valid production? No? Failure. Return False Yes! Accept symbol set cur_state to new_state is cur_state now “End”? Yes! Success! Return result_str concatenate symbol at cur_position to result_str Exhausted test sequence? Failure. Return False

Reducing an FSA to Python code The non-deterministic case is less straightforward! We can no longer just iterate over the test sequence! For each symbol in the test sequence, we might have to consider multiple valid productions (think loop, yes?) We therefore may need to explore “branches” corresponding to these alternatives before we find one that is “correct” Although not necessarily the most efficient way, recursion is an easy way to explore these branches: If a possible production is valid, assume that it is correct by accepting the symbol and new state Increment the position in the test sequence “Success” or “Failure” can easily be propagated back up through the recursion by testing the result of the recursive call and returning the resulting return sequence. If it gets past the recursive call test, the branch has failed, decrement the position in the test sequence, and go to the next possible production If there are no more productions to consider, we’ve failed, return False

Python focus – classes Like functions, minimally, all we need is a statement block of Perl code that we have given a name! Defining a class class I_dont_do_much (object): #any code you like!! pass Capital letters OK …but it won’t do anything interesting though until we have specified some data and some methods! Python classes are essentially user-defined data types

Python focus – classes This method corresponds to the “constructor” in other OOP languages The __init__ method class FSA (object): def __init__(self, states): self.states = states First argument always self Variables declared outside of a method have the same value in all instances of that class! Variables prepended with self become instance variables, and are visible throughout the namespace of a class instance Methods defined as functions

Python focus – classes User-defined methods class FSA (object): def __init__(self, states): self.states = states # initialize some other stuff def test (self, seq, cur_state = “Start”): some_var = 0 # do some things return something First argument always self Variable some_var is visible only within the user defined test method! User methods defined as functions These are the interface with your class

Python focus – classes Using classes my_FSA = FSA(my_state_dict) result = myFSA.test(“AGCTGGGGTTTAATT”) Instantiate a class by invoking its name, and providing the arguments the __init__ method expects We can make as many instances of a class as we need! Invoke class methods just by using the instance identifier in conjunction with the method name using attribute notation!