Lecture 8 perl pattern matching features

Slides:



Advertisements
Similar presentations
1 Introduction to Perl Part III: Biological Data Manipulation.
Advertisements

Computer Science & Engineering 2111 Text Functions 1CSE 2111 Lecture-Text Functions.
Patterns, Patterns and More Patterns Exploiting Perl's built-in regular expression technology.
Programming and Perl for Bioinformatics Part III.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
Introduction to Perl Bioinformatics. What is Perl? Practical Extraction and Report Language A scripting language Components an interpreter scripts: text.
Bioinformatics Lecture 7: Introduction to Perl. Introduction Basic concepts in Perl syntax: – variables, strings, input and output – Conditional and iteration.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 3: 8/28.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
Advanced Perl for Bioinformatics Lecture 5. Regular expressions - review You can put the pattern you want to match between //, bind the pattern to the.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
PERL Part 3 1.Subroutines 2.Pattern matching and regular expressions.
Regular Expressions Regular Expression (or pattern) in Perl – is a template that either matches or doesn’t match a given string. if( $str =~ /hello/){
CS 330 Programming Languages 10 / 10 / 2006 Instructor: Michael Eckmann.
Lecture 2 BNFO 135 Usman Roshan. Perl variables Scalar –Number –String Examples –$myname = “Roshan”; –$year = 2006;
Physical Mapping II + Perl CIS 667 March 2, 2004.
Regular Expressions Regular Expression (or pattern) in Perl – is a template that either matches or doesn’t match a given string. if( $str =~ /hello/){
Scripting Languages Chapter 8 More About Regular Expressions.
Advanced Perl for Bioinformatics Lecture 5. Regular expressions - review You can put the pattern you want to match between //, bind the pattern to the.
UNIX Filters.
Lecture 8: Basic concepts of subroutines. Functions In perl functions take the following format: – sub subname – { my $var1 = $_[0]; statements Return.
REGULAR EXPRESSIONS CHAPTER 14. REGULAR EXPRESSIONS A coded pattern used to search for matching patterns in text strings Commonly used for data validation.
Computer Programming for Biologists Class 2 Oct 31 st, 2014 Karsten Hokamp
Last Updated March 2006 Slide 1 Regular Expressions.
“Everything Else”. Find all substrings We’ve learned how to find the first location of a string in another string with find. What about finding all matches?
Bioinformatics is … - the use of computers and information technology to assist biological studies - a multi-dimensional and multi-lingual discipline Chapters.
Lecture 7: Perl pattern handling features. Pattern Matching Recall =~ is the pattern matching operator A first simple match example print “An methionine.
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
MCB 5472 Assignment #6: HMMER and using perl to perform repetitive tasks February 26, 2014.
Subroutines and Files Bioinformatics Ellen Walker Hiram College.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Introduction To Perl Susan Lukose. Introduction to Perl Practical Extraction and Report Language Easy to learn and use.
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
Agenda Regular Expressions (Appendix A in Text) –Definition / Purpose –Commands that Use Regular Expressions –Using Regular Expressions –Using the Replacement.
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
Matching in list context (Chapter 11 = ($str =~ /pattern/); This stores the list of the special ($1, $2,…) capturing variables into the.
Bioinformatics Introduction to Perl. Introduction What is Perl Basic concepts in Perl syntax: – variables, strings, – Use of strict (explicit variables)
Perl Language Yize Chen CS354. History Perl was designed by Larry Wall in 1987 as a text processing language Perl has revised several times and becomes.
Regular Expressions in PHP. Supported RE’s The most important set of regex functions start with preg. These functions are a PHP wrapper around the PCRE.
Clearly Visual Basic: Programming with Visual Basic 2008 Chapter 24 The String Section.
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
Copyright © 2003 Pearson Education, Inc. Slide 6a-1 The Web Wizard’s Guide to PHP by David Lash.
12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,
CS346 Regular Expressions1 Pattern Matching Regular Expression.
Overview of Bioinformatics 1 Module Denis Manley..
Iteration While / until/ for loop. While/ Do-while loops Iteration continues until condition is false: 3 important points to remember: 1.Initialise condition.
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
Copyright © 2003 ProsoftTraining. All rights reserved. Perl Fundamentals.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
Perl Variables: Array Web Programming1. Review: Perl Variables Scalar ► e.g. $var1 = “Mary”; $var2= 1; ► holds number, character, string Array ► e.g.
Prof. Alfred J Bird, Ph.D., NBCT Door Code for IT441 Students.
Operating System Discussion Section. The Basics of C Reference: Lecture note 2 and 3 notes.html.
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
Chapter 4 © 2009 by Addison Wesley Longman, Inc Pattern Matching - JavaScript provides two ways to do pattern matching: 1. Using RegExp objects.
Perl for Bioinformatics Part 2 Stuart Brown NYU School of Medicine.
Programming Perl in UNIX Course Number : CIT 370 Week 2 Prof. Daniel Chen.
Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if ($sequence =~ m/atg/) { print "Yes"; } else { print.
Dept. of Animal Breeding and Genetics Programming basics & introduction to PERL Mats Pettersson.
Bioinformatics Introduction to Perl. Introduction What is Perl Basic concepts in Perl syntax: – variables, strings, – Use of strict (explicit variables)
Introduction to Programming the WWW I CMSC Winter 2003 Lecture 17.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
CMSC330 More Ruby. Last lecture Scripting languages Ruby language –Implicit variable declarations –Many control statements –Classes & objects –Strings.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
Introduction to perl programming: the minimum to know for practice!
Presentation transcript:

Lecture 8 perl pattern matching features Bioinformatics Lecture 8 perl pattern matching features

Questions to think about Create a hash table that performs the condon to AA conversion and use it to convert codons {entered from the key board} into their corresponding Amino Acids Write a script that extracts the gene ID, and Gene name from the Descriptor header of a DNA FASTA file

Questions to think about Write a script that reads in the DNA sequences from two Fasta files, assume the sequence length is the same for both, and determines the number of alignment matches to non matches

Introduction Pattern Matching Pattern extraction Pattern Substitution Split and join functions Unpack function

Pattern Matching More patterns Recall =~ is the pattern matching operator A first simple match example print “EcoRI site found!” if $dna =~ /gat/; It means if $DNA (string) contains the pattern gat then print Ecori site found. What is inside the 2 / is the pattern and =~ is the pattern matching symbol More patterns if ($dna =~ /[GATCgatc]/ ) if /^[GATC] / i If ( $dna =~ /GAATTC|AAGCTT/) | (Boolean Or symbol) Print “EcoR1 site found!!!”;

Pattern Matching A More flexible pattern: print “EcoRI site found!” if $dna =~ /GAA[GATC]TTC/; Pattern where 4th letter is any let within square brackets [GATC] means any character other than G or A or T or C [0-9] or \d (digit) [ a-z] [-A-Z] /[AT][GC][TG]/ /[a-zA-Z0-9_]/ or /\w/ (word) / \s/ (white space) and to invert \s uppercase the letter \S (non white space)

Pattern matching: metacharacters Metacharacter Description . Any character except newline \. Full stop character ^ The beginning of a line $ The end of a line \w Any word character (non-punctuation, non-white space) \W Any non-word character \s White space (spaces, tabs, carriage returns) \S Non-white space \d Any digit \D Any non-digit You can also specify the number of times [ single, multiple or specific multiple] More information on variations of metacharacters here: metacharacters

Pattern matching: Quantifiers Quantifier Description ? 0 or 1 occurrence + 1 or more occurrences * 0 or more occurrences {N} n occurrences {N,M} Between N and M occurrences {N, } At least N occurrences { ,M} No more than M occurrences

Pattern matching: Quantifiers Pattern Match the following format: M58200.2 { =~/\w+\.\d+/ } If the sequence is: Pu-C-X(40-80)-Pu-C Pu [AG] and X[ATGC] $sequence = /[AG]C[GATC]{40,80}[AG]C/;

Extracting pattern to variables Anchors E..g. Matching a word exactly: /\bword\b/ \b boundary: just looks for word and not a sequence of the letters w o r and d The start of line anchor ^ /^>/ only those beginning with > The end of line character $ />$/ only where the last character is > /^$/ : what does this mean?

Further examples File_size_base_only.pl example #!/usr/bin/perl # file size2.pl $length = 0; $lines = 0; while (<>) { chomp; $length = $length + length $_ if $_ =~ /[GATCgatc]/; #Alternative: $length += length if /^[GATCN] / i; $lines = $lines + 1; } print "LENGTH = $length\n"; print "LINES = $lines\n";

FASTA files Sample of an NCBI record format: Write and test (file_size_bases_only.pl) using a FASTA file as input: FASTADNA1.txt: example of FASTA file >2L52.1 CE20433 Zinc finger, C2H2 type (CAMBRIDGE) protein id:CAA21776.1 GCAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC sample of file in EMBL format gccacagatt acaggaagtc atatttttag acctaaatca ctatcctcta tctttcagca 60 agaaaagaac atctacttgg tttcgttccc tatccaagat tcagatggtg aaacgagtga 120 tcatgcacct gatgaacgtg caaaaccaca gtcaagccat gacaaccccg atctacagtt 180 tgatgttgaa actgccgatt ggtacgccta cagtgaaaac tatggcacaa gtgaagaaaa 240 Sample of an NCBI record format: 1 atgaacccca acctgtgggt cgacgcgcag agcacttgca agagggaatg cgacgctgac 61 ctggagtgcg agacctttga gaagtgctgc cccaatgtct gtggaaccaa gagctgtgtg 121 gctgctcggt acatggacat caaggggaag aaggggcctg tggggatgcc caaagaggca 181 acctgtgacc gcttcatgtg catccagcaa ggctcagagt gcgacatctg ggacgggcag 241 cctgtctgca agtgcaagga caggtgtgag aaggagccga gctttacctg cgcctcggac

Extracting Patterns Consider a sequence like >M185580 clone 333a, complete sequence > M18… is the sequence ID Clone 33a, com…. : optional comments Need to stored some of elements of the descriptor line: =~/ ( \S+)/ part of the match is extracted and put into variable $1;

Extracting patterns #! /usr/bin/perl –w # demonstrates the effect of parentheses. while ( my $line = <> ) { $line =~ /\w+ (\w+) \w+ (\w+)/; print "Second word: '$1' on line $..\n" if defined $1; print "Fourth word: '$2' on line $..\n" if defined $2; } Change it to catch the first and the 3 word of a sentence

Search and replace s/t/u/ replace (t)thymine with (u) Uracil; once only s/t/u/g (g = global) so scan the whole string s/t/u/gi (global and case insensitive) What about the following : s/^\s+// s/\s+$// s/\s+$/ /g (where g stands for global) Write a perl script that reads in the DNA sequences from the FastaDNA1file.txt and replaces all the Thymine bases with the corresponding Uracil bases

Splits and joins To transform strings into arrays: split Line 1 looks like: 192a8,The Stranger DNA ,GGGTTCCGATTTCCAA,CCTTAGGCCAAATTAAGGCC Consider the following code: chomp($line = <>); # read the line into $line @fields = split ‘,’,$line; ($clone,$laboratory,$left_oligo,$right_oligo) = split ‘,’,$line; Reads in line 1 and puts each part before the delimiter; e.g. 192a8, into element of array…. To transform arrays (lists) into strings: join $tab = join “\t”,@fields; 192a8 The Sanger Centre GGGTTCCGATTTCCAA CCTTAGGCCAAATTAAGGCC #initialize an array my @perlFunc = ("substr","grep","defined","undef"); my $perlFunc = join " ", @perlFunc; print "Perl Functions: $perlFunc\n"; See example split_file.pl

Other useful functions Unpack syntax : @triplets = unpack("a3" x (length($line)/3), $line); Frame Shift (1 position to the right) @triplets = unpack(‘a’ . “a3” x (length ($line)/3),$line); Unpack_codons.pl

Questions Modify the file_bases_size_only.pl to count the the number of bases for a file in an EMBL format and one in an NCBI format Using the FASTADNA1.txt : extract the sections of the descriptor line to appropriate scalar variables. Assuming the DNA sequence of FastaDNA1file.txt is the complementary or anti-sense strand print the mRNA when the primary strand ( sequence ) is transcribed

Exam Questions Perl is a important bioinformatics language. Explain the main features of perl that make in appealing to the field of Bioinformatics. Write a script that extracts the gene ID, and Gene name from the Descriptor header of a DNA FASTA file Write a perl script only reads and prints DNA sequences from a FASTA file. Write a script that reads in the DNA sequences from two Fasta files, assume the sequence length is the same for both, and determines the number of alignment matches to non matches

FastaDNA1file.txt Write a script that reads in the DNA sequences from two Fasta files, assume the sequence length is the same for both, and illustrates the number of alignment matches to non matches.