Lecture 7: Perl pattern handling features. Pattern Matching Recall =~ is the pattern matching operator A first simple match example print “An methionine.

Slides:



Advertisements
Similar presentations
Dynamic Arrays Lecture 4. Arrays In many languages the size of the array is fixed however in perl an array is considered to be dynamic: its size can be.
Advertisements

Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
ISBN Chapter 6 Data Types Character Strings Pattern Matching.
1 CSE 390a Lecture 7 Regular expressions, egrep, and sed slides created by Marty Stepp, modified by Jessica Miller
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Regular Expressions Regular Expression (or pattern) in Perl – is a template that either matches or doesn’t match a given string. if( $str =~ /hello/){
Tools for building compilers Clara Benac Earle. Tools to help building a compiler C –Lexical Analyzer generators: Lex, flex, –Syntax Analyzer generator:
CS 330 Programming Languages 10 / 10 / 2006 Instructor: Michael Eckmann.
7.1 Some Eclipse Tips Try Ctrl+Shift+L Quick help (keyboard shortcuts) Try Ctrl+SPACE Auto-complete Source→Format ( Ctrl+Shift+F ) Correct indentation.
8.1 Last time on: Pattern Matching. 8.2 Finding a sub string (match) somewhere: if ($line =~ m/he/)... remember to use slash( / ) and not back-slash Will.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Regular Expressions Regular Expression (or pattern) in Perl – is a template that either matches or doesn’t match a given string. if( $str =~ /hello/){
Scripting Languages Chapter 8 More About Regular Expressions.
Lecture 8: Basic concepts of subroutines. Functions In perl functions take the following format: – sub subname – { my $var1 = $_[0]; statements Return.
REGULAR EXPRESSIONS CHAPTER 14. REGULAR EXPRESSIONS A coded pattern used to search for matching patterns in text strings Commonly used for data validation.
Computer Programming for Biologists Class 2 Oct 31 st, 2014 Karsten Hokamp
Bioinformatics is … - the use of computers and information technology to assist biological studies - a multi-dimensional and multi-lingual discipline Chapters.
Programming Perl in UNIX Course Number : CIT 370 Week 4 Prof. Daniel Chen.
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
Lecture 8 perl pattern matching features
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Agenda Regular Expressions (Appendix A in Text) –Definition / Purpose –Commands that Use Regular Expressions –Using Regular Expressions –Using the Replacement.
BTANT129 w61 Regular expressions step by step Tamás Váradi
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Programming in Perl regular expressions and m,s operators Peter Verhás January 2002.
Kirkwood Center for Continuing Education Introduction to PHP and MySQL By Fred McClurg, Copyright © 2015, Fred McClurg, All Rights.
Bioinformatics Introduction to Perl. Introduction What is Perl Basic concepts in Perl syntax: – variables, strings, – Use of strict (explicit variables)
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Kirkwood Center for Continuing Education Introduction to PHP and MySQL By Fred McClurg, Copyright © 2010 All Rights Reserved. 1.
Regular Expressions. Overview Regular expressions allow you to do complex searches within text documents. Examples: Search 8-K filings for restatements.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
JavaScript, Part 2 Instructor: Charles Moen CSCI/CINF 4230.
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
CS346 Regular Expressions1 Pattern Matching Regular Expression.
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
Iteration While / until/ for loop. While/ Do-while loops Iteration continues until condition is false: 3 important points to remember: 1.Initialise condition.
20-753: Fundamentals of Web Programming 1 Lecture 10: Server-Side Scripting II Fundamentals of Web Programming Lecture 10: Server-Side Scripting II.
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
Strings and Patterns in Perl Ellen Walker Bioinformatics Hiram College.
Regular Expressions CS 2204 Class meeting 6 Created by Doug Bowman, 2001 Modified by Mir Farooq Ali, 2002.
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
Prof. Alfred J Bird, Ph.D., NBCT Door Code for IT441 Students.
What is grep ?  % man grep  DESCRIPTION  The grep utility searches text files for a pattern and prints all lines that contain that pattern. It uses.
Operating System Discussion Section. The Basics of C Reference: Lecture note 2 and 3 notes.html.
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if ($sequence =~ m/atg/) { print "Yes"; } else { print.
Bioinformatics Introduction to Perl. Introduction What is Perl Basic concepts in Perl syntax: – variables, strings, – Use of strict (explicit variables)
Introduction to Programming the WWW I CMSC Winter 2003 Lecture 17.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
Lecture 6.11
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
Scripting Languages Course 5 Diana Trandab ă ț Master in Computational Linguistics - 1 st year
Scripting Languages Perl – course 3
CS 330 Class 7 Comments on Exam Programming plan for today:
Introduction to Bioinformatic Computation. Lecture #
Regular Expressions and perl
Lecture 9 Shell Programming – Command substitution
CSE 390a Lecture 7 Regular expressions, egrep, and sed
CSE 390a Lecture 7 Regular expressions, egrep, and sed
Regular expressions, egrep, and sed
Lecture 25: Regular Expressions
Introduction to Bioinformatic Computation. Lecture #
CSE 390a Lecture 7 Regular expressions, egrep, and sed
Presentation transcript:

Lecture 7: Perl pattern handling features

Pattern Matching Recall =~ is the pattern matching operator A first simple match example print “An methionine amino acid is found ” if $AA =~ /m/; – It means if $AA (string) contains the m then print methionine amino acid found. – What is inside the / / is the pattern and =~ is the pattern matching symbol – It could also be written as if ($dna =~ /m/) { – print “An methionine amino acid is found ”; } – Met.pl

Pattern Matching – If we want to check for the start codon we could use: – if ($seq =~ /ATG/ ) – { Print “a start codon was found on line number\n” } – Or could write if /ATG / i (where I stands for case) – if we want to see if there is an A or T or G or C in the sequence use: $seq =~ /[ATGC]/ – The main way to use the Boolean OR is If ( $dna =~ /GAATTC|AAGCTT/) | (Boolean Or symbol) { – Print “EcoR1 site found!!!”; } – (note EcoR1 is an important DNA sequence)

Sequence size example File_size_2 example – #!/usr/bin/perl – # file size2.pl – $length = 0; $lines = 0; – while (<>) { chomp; $length = $length + length $_ if $_ =~ /[GATCNgatcn]/; # n refers to any nucelotide #{refer to – $lines = $lines + 1; – } – print "LENGTH = $length\n"; print "LINES = $lines\n"; The above is a modification of the length of the file example to include only files that have G or A or T or C in the input line. However this will lead to problems for FASTA files as the descriptor line will be included: Why?

Pattern Matching A NOT Boolean operator such as to see if the pattern contains letters that are not vowels can be represented via pattern handling by using the ^ symbol and a set of characters: e.g. – If ($seq =~ /[^aeiou]/ {print “no vowel”}; More flexible pattern syntax: Quite common to check for words or numbers so perl has represented as: – /[0-9]/ or/ \d/ is a digit – A word character is /[a-zA-Z0-9_]/ and is represented by /\w/ (word) – / \s/ represents a white space – By invert the case of the letter it has the reverse meaning; e.g. /\S/ (non white space) A more complete list of what are referred to as “metacharacters” is shown in the next slide (you must of course use =~ in expression)

Pattern matching: metacharacters Metacharacter Description. Any character except newline \.Full stop character ^ The beginning of a line $ The end of a line \w Any word character (non-punctuation, non-white space) \W Any non-word character \s White space (spaces, tabs, carriage returns) \S Non-white space \d Any digit \D Any non-digit You can also specify the number of times [ single, multiple or specific multiple] More information on metacharacters here: metacharacters and other regular expresions note (abc) \1 \2 are important for comparing sets of characters).metacharactersregular expresions

Pattern matching: Quantifiers Quantifier Description – ? 0 or 1 occurrence – +1 or more occurrences – * 0 or more occurrences – {N}n occurrences – {N,M} Between N and M occurrences – {N, } At least N occurrences – {,M} No more than M occurrences

Pattern matching: Quantifiers Consider the following pattern – DT249 4 (your class code) consists of [one or more word characters; then a space and then a digit so the match is: { =~/\w+\s\d/ } If the sequence has the following format: – Pu-C-X(40-80)-Pu-C Pu [AG] and X[ATGC] – $sequence =~ /[AG]C[GATC]{40,80}[AG]C/; Quantify.pl

Pattern Matching To determine where to look for a “pattern” in a sequence: Anchors – The start of line anchor ^ {note it is like the Boolean not operator but it is within [^aeiou]} /^>/ only those beginning with > – The end of line character $ />$/ only where the last character is > – /^$/ : what does this mean? – The boundary anchor \b E.g. Matching a word exactly: /\bword\b/ where \b boundary: just looks for “word” and not a sequence of the letters such as w o r and d – The non boundary anchor is \B /\Bword\B/ look for words like unworthy, trustworthy….. But not worthy or word

Sequence Size example: modified File_size_2 example – #!/usr/bin/perl – # file size2.pl – $length = 0; $lines = 0; – while (<>) { chomp; $length = $length + length $_ if $_ =~ /[GATCNgatcn]+$/; – #Alternative: $length += length if /^[GATCN]+$ / i; $lines = $lines + 1; – } – print "LENGTH = $length\n"; print "LINES = $lines\n"; Refer to DNA sequence codes to see meaning of A…NDNA sequence codes

Extracting Patterns The second aspect of Perl pattern handling is: Pattern extraction: Consider a sequence like > M185580, clone 333a, complete sequence – M18… is the sequence ID – Clone 33a, com…. : optional comments Need to stored some of elements of the descriptor line: – $seq =~/ ( \S+)/ part of the match is extracted and put into variable $1;

Extracting patterns #! /usr/bin/perl –w # demonstrates the effect of parentheses. while ( my $line = <> ) { $line =~ /\w+ (\w+) \w+ (\w+)/; print "Second word: '$1' on line $..\n" if defined $1; print "Fourth word: '$2' on line $..\n" if defined $2; } – Change it to catch the first and the 3 word of a sentence More examples in ExtractExample1.pl

Search/replace and trans-literial s/t/u/ replace (t)thymine with (u) Uracil; once only s/t/u/g (g = global) so scan the whole string s/t/u/gi (global and case insensitive) – What about the following : – s/^\s+// – s/\s+$// – s/\s+/ /g (where g stands for global) The transliteration search and replace function – $seq =~ tr/ATGC/TACG/; gets the compliment of a string of characters. (the normal search and replace works in a different way to the tr function) Refer to SearchReplace.pl

Search /replace/extract Write a program that removes the > from the FASTA line descriptor and assigns each element to appropriate variables. Example Fastafile_replace.txt – >gi|171361, Saccharomyces cerevisiae, cystathionine gamma-lyase – GCAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC – GCTACAGAGCCAACCCGGTGGACAAACTCGAAGTCATTGTGGACCGAATGAGGCTCAATAACGAGATTAGCG – ACCTCGAAGGCCTGCGCAAATATTTCCACTCCTTCCCGGGTGCTCCTGAGTTGAACCCGCTTAGAGACTCCG – AAATCAACGACGACTTCCACCAGTGGGCCCAGTGTGACCGCCACACTGGACCCCATACCACTTCTTTTTGTT – ATTCTTAAATATGTTGTAACGCTATGTAATTCCACCCTTCATTACTAATAATTAGCCATTCACGTGATCTCA – GCCAGTTGTGGCGCCACACTTTTTTTTCCATAAAAATCCTCGAGGAAAAGAAAAGAAAAAAATATTTCAGTT – ATTTAAAGCATAAGATGCCAGGTAGATGGAACTTGTGCCGTGCCAGATTGAATTTTGAAAGTACAATTGAGG – CCTATACACATAGACATTTGCACCTTATACATATAC

Exercises Write a script that: 1.Confirms if the user has input the code in the following format: Classcode_yearcode(papercode) E.g dt249 4(w203c) 2.Many important DNA sequences have specific patters; e.g. TATA write a script to find the position of this sequence in a FASTA file sequence.

Exercises 3.Write a script that can find the reverse complement of an DNA sequence without using the tr function. (Hint: a global search and replace will give an incorrect answer) 4.Coding regions begin win the AUG (ATG) codon and end with a stop codons. Write a perl script that extract a coding sequence from a FASTA file.

Exercise 5.Modify the Sequence size example from earlier to: – Allow the user to input a file name and determine its length.

Exam Questions Perl is a important bioinformatics language. Explain the main features of perl that make it suitable for bioinformatics (10 marks) Write a perl script that illustrates its pattern matching extraction and substitution ability. (6 marks) (refer to assignment/previous papers perl scripts)