Chapter 7 - Sequence patterns1 Chapter 7 – Sequence patterns (first part) We want a signature for a protein sequence family. The signature should ideally.

Slides:



Advertisements
Similar presentations
Why empty strings? A bit like zero –you may think you can do without –but it makes definitions & calculations easier Definitions: –An alphabeth is a finite.
Advertisements

Lecture 6 Nondeterministic Finite Automata (NFA)
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 2 Mälardalen University 2005.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Pushdown Automata Chapter 12. Recognizing Context-Free Languages Two notions of recognition: (1) Say yes or no, just like with FSMs (2) Say yes or no,
1 Languages. 2 A language is a set of strings String: A sequence of letters Examples: “cat”, “dog”, “house”, … Defined over an alphabet: Languages.
Applied Computer Science II Chapter 1 : Regular Languages Prof. Dr. Luc De Raedt Institut für Informatik Albert-Ludwigs Universität Freiburg Germany.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Biology 224 Dr. Tom Peavy Sept 27 & 29 Protein Structure & Analysis.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
Fall 2006Costas Busch - RPI1 Deterministic Finite Automata And Regular Languages.
Query Languages: Patterns & Structures. Pattern Matching Pattern –a set of syntactic features that must occur in a text segment Types of patterns –Words:
Matching Problems in Bioinformatics Charles Yan Fall 2008.
1 Finite Automata. 2 Finite Automaton Input “Accept” or “Reject” String Finite Automaton Output.
1 Languages and Finite Automata or how to talk to machines...
CS5371 Theory of Computation Lecture 1: Mathematics Review I (Basic Terminology)
Finite Automata Chapter 5. Formal Language Definitions Why need formal definitions of language –Define a precise, unambiguous and uniform interpretation.
Topics Automata Theory Grammars and Languages Complexities
Introduction to Finite Automata Adapted from the slides of Stanford CS154.
Finite Automata Costas Busch - RPI.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Single Motif Charles Yan Spring Single Motif.
Lecture 3 Graph Representation for Regular Expressions
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 7 Mälardalen University 2010.
C OMPUTATIONAL BIOLOGY. O UTLINE Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity of the Algorithms.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
331 Final Spring Details 6-8 pm next Monday Comprehensive with more emphasis on material since the midterm Study example finals and midterm exams.
PZ02B Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ02B - Regular grammars Programming Language Design.
March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Transformational Grammars and PROSITE Patterns Roland Miezianko CIS Bioinformatics Prof. Vucetic.
An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Regular Expressions Hopcroft, Motawi, Ullman, Chap 3.
MA/CSSE 474 Theory of Computation Decision Problems DFSMs.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
CHAPTER 1 Regular Languages
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
Finite Automata Chapter 1. Automatic Door Example Top View.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
Finding Regular Simple Paths Sept. 2013Yangjun Chen ACS Finding Regular Simple Paths in Graph Databases Basic definitions Regular paths Regular simple.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
NP-complete Languages
CS 154 Formal Languages and Computability February 11 Class Meeting Department of Computer Science San Jose State University Spring 2016 Instructor: Ron.
Regular Expressions CS 130: Theory of Computation HMU textbook, Chapter 3.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Pushdown Automata Chapter 12. Recognizing Context-Free Languages Two notions of recognition: (1) Say yes or no, just like with FSMs (2) Say yes or no,
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
1 Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Finding approximate occurrences of a pattern that contains gaps Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park.
Fall 2004COMP 3351 Finite Automata. Fall 2004COMP 3352 Finite Automaton Input String Output String Finite Automaton.
Deterministic Finite-State Machine (or Deterministic Finite Automaton) A DFA is a 5-tuple, (S, Σ, T, s, A), consisting of: S: a finite set of states Σ:
Theory of Computation Lecture #
Languages.
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Formal Language Theory
Chapter 2 FINITE AUTOMATA.
CHAPTER 2 Context-Free Languages
Finite Automata.
Finite Automata.
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Presentation transcript:

Chapter 7 - Sequence patterns1 Chapter 7 – Sequence patterns (first part) We want a signature for a protein sequence family. The signature should ideally satisfy: All sequences in the family should satisfy the signature No other sequences should satisfy the signature We can divide the used signatures into Probabilistic, a score is calculated between a sequence and the signature (how well a sequence match the model of the family) –Profile, HMM profile,... Deterministic, a sequence either satisfies (matches) the signature, or not –Regular expresion, sequence pattern (motif)

Chapter 7 - Sequence patterns2 Regular expressions – from Gusfield 3.6 A method for describing a pattern A pattern can be used to describe what is common to a set of sequences/strings Example: PROSITE-pattern –[AS]-x(2,4)-A-x(1)-[CA] –Symbols in [] means alternative –x(i,j) means between i and j arbitrary symbols (wild cards) –Several other PROSITE rules

Chapter 7 - Sequence patterns3 Regular expressions cont’ Formal definition of a regular expression (RE) –  is an alphabet (e.g. The 20 amino acids) –{* + ( )  } are not in  A string T matchs a regular expression R if R specifies T

Chapter 7 - Sequence patterns4 Regular expressions cont’ We can represent a regular expression R as a graph G(R) (non-deterministic finite state machine). –Make a start node s –Make an end node t –Each edge are labeled by a symbol from –A path from s to t represent a string specified by R –All strings specified by R corresponds to a path

Chapter 7 - Sequence patterns5 Search with regular expression Search for match (T,R) Are there substrings of T matching R See first if match(prefix(T),R) Are there prefixes of T matching R? –Make sets N(0), N(1), …. If T is of length m, and the regular expression R contains n symbols, then it is possible in time O(nm) to decide if T contains a substring that matches R.

Chapter 7 - Sequence patterns6 Prosite language PROSITE is a database of protein families and domains. The standard one-letter codes for the amino acids are used The symbol `x' is used for an arbitrary amino acid Ambiguities are listed between square parentheses `[ ]'. For example: [AGL]= stands for A or G or L Amino acids that are not accepted at a given position are listed between `{ }'. For example: {CH} stands for any amino acid except C and H

Chapter 7 - Sequence patterns7 Prosite language cont’ `-' is used for separating the elements Repetition of an element is specified with a numerical value or a numerical range between parenthesis, such that x(3) corresponds to x-x-x and x(1,3) corresponds to x or x-x or x-x-x When a pattern is restricted to either the N- or C- terminal of a sequence, that pattern either starts with a ` ' symbol A period ends the pattern

Chapter 7 - Sequence patterns8 Prosite language [RK]-x(2,3)-[DE]-x(2,3)-Y is matched by KLRACEDEEYRE D-x-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}- [LIVMC]-[DENQSTAGC]-x(2)-[DE]-[LIVMFYW] is matched by MADANADDDCTAADWST

Chapter 7 - Sequence patterns9 Exact/approximate matching Shall we make a unique pattern, and allow variations in the search? Shall we allow variations in the pattern? Consider deterministic patterns –Constituted of components and wildcard regions –Restrictions in the number of, and in types on these, defines classes of patterns –A component are of fixed length, but can be Unique Ambigeous –A wildcard region can be Fixed, of fixed length Flexible, varying length Given a set of sequences, try to discover a pattern of a given class

Chapter 7 - Sequence patterns10 Scoring of patterns Score the components by scoring each position, and then add Score the wildcards regions Sum over all Use information content –The information content of a position with value K i is the reduction in uncertainty of knowing K i relative knowing nothing. –Scoring of wildcard regions should decrease with increasing flexibility Scoring of x(j k, i k ) could be –c(j k -i k )

Chapter 7 - Sequence patterns11 Generalization/specialization Generalization of a pattern means weakening it Specialization means strengthening it If p’ is a generalization of p, then all sequences that matches p also matches p’

Chapter 7 - Sequence patterns12 Pattern discovery