Physical Mapping II + Perl CIS 667 March 2, 2004.

Slides:



Advertisements
Similar presentations
Lecture 24 MAS 714 Hartmut Klauck
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.
Graph Isomorphism Algorithms and networks. Graph Isomorphism 2 Today Graph isomorphism: definition Complexity: isomorphism completeness The refinement.
Train DEPOT PROBLEM USING PERMUTATION GRAPHS
Lectures on Network Flows
Solving Systems of Linear Equations Part Pivot a Matrix 2. Gaussian Elimination Method 3. Infinitely Many Solutions 4. Inconsistent System 5. Geometric.
Advanced Perl for Bioinformatics Lecture 5. Regular expressions - review You can put the pattern you want to match between //, bind the pattern to the.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Sequencing and Sequence Alignment
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.
The restriction mapping problem revisited Gopal Pandurangan and H. Ramesh Journal of Computer and System Sciences 526~544(2002)
Chapter 11: Limitations of Algorithmic Power
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
ARCHEOLOGICAL SERIATION AND INTERVAL GRAPHS
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Sequencing a genome and Basic Sequence Alignment
Introduction to Bioinformatics Algorithms Exhaustive Search and Branch-and-Bound Algorithms for Partial Digest Mapping.
Advanced Perl for Bioinformatics Lecture 5. Regular expressions - review You can put the pattern you want to match between //, bind the pattern to the.
Division Algorithm Let (x) and g(x) be polynomials with g(x) of lower degree than (x) and g(x) of degree one or more. There exists unique polynomials.
Case Study. DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known.
1 Physical Mapping --An Algorithm and An Approximation for Hybridization Mapping Shi Chen CSE497 04Mar2004.
Systems of Equations and Inequalities
1 1.1 © 2012 Pearson Education, Inc. Linear Equations in Linear Algebra SYSTEMS OF LINEAR EQUATIONS.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Induction and recursion
Physical Mapping of DNA Shanna Terry March 2, 2004.
Row rows A matrix is a rectangular array of numbers. We subscript entries to tell their location in the array Matrices are identified by their size.
MAPS OF DNA AND INTERVAL GRAPHS by Akshita Gurram.
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
Physical Mapping of DNA BIO/CS 471 – Algorithms for Bioinformatics.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
Sequencing a genome and Basic Sequence Alignment
Human awareness.  M16.1 Know that the DNA can be extracted from cells  Genetic engineering and /or genetic modification have been made possible by isolating.
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
Techniques for Proving NP-Completeness Show that a special case of the problem you are interested in is NP- complete. For example: The problem of finding.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Memory Allocation of Multi programming using Permutation Graph By Bhavani Duggineni.
Slide Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley.
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
Instructor Neelima Gupta Table of Contents Class NP Class NPC Approximation Algorithms.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Meeting 18 Matrix Operations. Matrix If A is an m x n matrix - that is, a matrix with m rows and n columns – then the scalar entry in the i th row and.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
CompSci 102 Discrete Math for Computer Science February 7, 2012 Prof. Rodger Slides modified from Rosen.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
LIMITATIONS OF ALGORITHM POWER
CSC 413/513: Intro to Algorithms
1 Finding a decomposition of a graph T into isomorphic copies of a graph G is a classical problem in Combinatorics. The G-decomposition of T is balanced.
BINF 634 Fall LECTURE061 Outline Lab 1 (Quiz 3) Solution Program 2 Scoping Algorithm efficiency Sorting Hashes Review for midterm Quiz 4 Outline.
Dept. of Animal Breeding and Genetics Programming basics & introduction to PERL Mats Pettersson.
An Algorithm for the Consecutive Ones Property Claudio Eccher.
COSC 3101A - Design and Analysis of Algorithms 14 NP-Completeness.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Theory of Computational Complexity Yusuke FURUKAWA Iwama Ito lab M1.
Dynamic Programming for the Edit Distance Problem.
Sequence comparison: Local alignment
JinJu Lee & Beatrice Seifert CSE 5311 Fall 2005 Week 10 (Nov 1 & 3)
CSE 589 Applied Algorithms Spring 1999
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Physical Mapping II + Perl CIS 667 March 2, 2004

Restriction Site Models Let each fragment in the Double Digest Problem be represented by its length  No measurement errors  All fragments present Digesting the target DNA by the first enzyme gives the multiset A = {a 1, a 2, …, a n } The second enzyme gives B = {b 1, b 2, …, b n } Digestion with both gives O = {o 1, o 2, …, o n }

Restriction Site Models We want to find a permutation  A of the elements of A and  B of the elements of B  Plot lengths  A from on a line in the order of  A  Plot lengths  B from on a line in the order of  B on top of previous plot  Several new subintervals may be produced  We need a one-to-one correspondence between each resulting subinterval and each element of O

Restriction Site Models This problem is NP-complete  It is a generalization of the set partition problem  The number of solutions is exponential Partial Digest problem has not been proven to be NP-complete  The number of solutions is much smaller than for DDP

Interval Graph Models We model hybridization mapping using interval graphs  Much simpler than the real problem, but still NP-complete  Uses graphs  Vertices represent clones  Edges represent overlap information between clones

First Interval Graph Model Uses two graphs  G r = (V, E r )  (i, j)  E r means we know clones i, j overlap  G t = (V, E t )  E t represents known and unknown overlap information  If we know for sure that two clones don’t overlap, the corresponding edge is left out of the graph G t

First Interval Graph Model Does there exist a graph G s = (V, E s ) such that E r  E s  E t such that G s is an interval graph?  An interval graph G = (V, E) is an undirected graph obtained from a collection C of intervals on the real line  To each interval in C there corresponds a vertex in G  There is an edge between u and v only if their intervals have a non-empty intersection

First Interval Graph Model a b c d e a b c d e

Non-Interval Graphs a b c d e a b c d e

Second Interval Graph Model Don’t assume that known overlap information is reliable  Construct a graph G = (V, E) using that information  Does there exist a graph G’ = (V, E’) such that E’  E, G’ is an interval graph and |E’| is maximum?  We have discarded some false positives  The solution is the interpretation that contains the minimum number of false positives

Third Interval Graph Model Use overlap information along with information about each clone  Different clones come from different copies of the same molecule  Label each clone with the identification of the molecule copy it came from  Assume we had k copies of the target DNA and different restriction enzymes were used to break up each copy

Third Interval Graph Model Build a graph G = (V, E) with known overlap information between clones  Use k colors to color the vertices  No edges between vertices of the same color since they come from the same clone and hence cannot overlap  We say that such a graph has a valid coloring  Does there exist graph G’ = (V’, E) such that, G’ is an interval graph, and the coloring of G is valid for G’?  I.e., Can we add edges to G transforming it into an interval graph without violating the coloring?

Consecutive Ones Property We can apply the previous models in any situation where we can obtain some type of fingerprint for each fragment  Now we use as a clone fingerprint the set of probes that hybridize to it  Assumptions  Reverse complement of each probe’s sequence occurs only once in the target DNA (“probes are unique”  There are no errors  All “clones X probes” hybridization experiments have been done

Consecutive Ones Property If we have n clones and m probes we will build an n  m binary matrix M, where each entry M ij tells us whether probe j hybridized to clone i or not  Then obtaining a physical map from the matrix becomes the problem of finding a permutation of the columns (probes) such that all 1s in each row (clone) are consecutive  Such a matrix is said to have the consecutive 1s property for rows (C1P)

Consecutive Ones Property There exist polynomial algorithms for the C1P property  Works only for data with no errors  Realistic algorithms should try to find matrixes which approximate the C1P property, while minimizing the number of errors which must have been present to lead to such a solution  Allow 2 or 3 runs of 1s in a row  Minimize the number of runs of 1s in the matrix Problem is now NP-hard

Now we will look at some Perl in preparation for assignment 1

Perl substitution operator Example of Perl substitution operator $RNA =~ s/T/U/g; variable binding operator substitute operator PATTERN regular expression To be replaced by REPLACEMENT delimiter REPLACEMENT text to replace PATTERN Pattern modifier: g means globally, throughout the string. Others: i case insensitive m multiline s single line

Example 1 Let’s use the substitution operator to calculate the reverse complement of a strand of DNA

Example 2 One common task in bioinformatics is to look for motifs, short segments of DNA or protein of interest  For example, regulatory elements of DNA Let’s see a program to  Read in protein sequence data from a file  Put all the sequence data into one string for easy searching  Look for motifs the user types in at the keyboard

Turning arrays into Scalars We often find sequence data broken into short segments of 80 or so characters  This is inconvenient for the Perl program  Have to deal with motifs on more than one line  Collapse an array into a scalar with join  $protein = join(

Regular expressions Regular expressions are ways of matching one or more strings using special wildcard- like operators  $protein =~ s/\s//g  \s matches whitespace  Can also be written [ \t\n\f\r]  if ($motif =~ /^\s*$/ ) {  ^ - beginning of line; $ - end of line  * repeated zero or more times

Hashes There are three main data types in Perl: scalar variables, arrays and hashes (also called associative arrays)  A hash provides a fast lookup of the value associated with a key  Initialized like this: %classification = ( ‘dog’ => ‘mammal’, ‘robin’=> ‘bird’ ‘asp’=> ‘reptile’ );

Example 3 Let’s look at the use of a hash by a subroutine to translate a codon to an amino acid using hash lookup  codon2aa

Example 3 The arguments to the subroutine are in array Declare a local variable as a my variable my($dna)

Example 4 We can use that subroutine to translate DNA into protein Note the use of a module (library) Note the use of.= to concatenate